I. Problems with
Information Retrieval
II. What Do We Want?
III. Standard Techniques
IV. Self-GeneratingText
Network
V. Basic Operations
VI. Associations
VII. Topical Facets
VIII. Real World Example
IX. Document Retrieval
X. Retrieval Output
XI. Serendipical Search
XII. Computational
Complexity
XIII. Final Remarks
 If you don’t know the data, you can’t even start to
search
 To know a search word might result in the return
of an overload of out-of-context documents
 Having a word without knowing its synonyms and
derivations results in overlooking potentially
relevant data
 Documents that are only conceptually related, are
not retrieved
The practice of committing huge data repositories
to manually tuned trees or tables with standard
taxonomies/ontologies is:
 unaffordable due to high cost in money and time
 has the undesirable side effect of curtailing the
creativeness of its users (search for what you
expect to find).
 The resulting dimensions can be difficult to interpret.
For example, in:
{(car), (truck), (flower)} {(1.3452 * car + 0.2828 * truck), flower)}
the component (1.3452 * car + 0.2828 * truck) could be
interpreted as a vehicle. However, cases occur such as:
{(car), (bottle), (flower)} {(1.3452 * car + 0.2828 * bottle), flower)}
what can be justified mathematically, but has no
meaning in natural language.
 The probabilistic model of LSA does not match the
observed data: LSA assumes that words and documents
form a Gaussian model (ergodic hypothesis).
One or more of the following:
 Eliminate search altogether - as far as possible
 Enable unsupervised discovery of unknown things
 Let the system organize the documents (no tagging,
less paper shuffling)
 Enrich the available structured information
automatically
 …
III. Standard Classification Techniques: Bag-Of-Words
model
Latent Semantic
Analysis
k-nearest
neighbor
Support Vector
Machine
Radial Basis
Function kernel
Diagram by: Cardoso-Cachopo, A., & Oliveira, A. L. (2003). An Empirical Comparison of Text
Categorization Methods. Paper presented at the 10th International Symposium on String Processing
and Information Retrieval (SPIRE), Manaus (BR).
 Ignoring the word position in the document
 Ignoring the ordering of words in the document
 Language production is not Gaussian
 Ignoring the information contained in HTML tags
 Multilingual documents
 Word separation/morphology may be tricky (e.g.,
German, Chinese, etc.)
 No comprehensive evaluation on large non-English
corpora
New York Times' use of token-types
in five consecutive days
Oct. 1-5, 1998
used once
67%
two days
14%
three days
7%
four days
5%
five days
7%
used once
71%
two days
14%
three days
6%
four days
4%
five days
5%
CNN's use of token-types
in five consecutive days
Dec. 21-25, 1998
Human language is not a finite state machine that allows
the prediction of the transitions of a finite set of elements
with a finite set of states
The curse of dimensionality
The curse of dimensionality
215 steps
100+ dimensions?
1
2 3
4
Relationship 'ash - volcanos' accessible
in four steps in a network
Properties of the
Model
Mathematical
Basis
Without
Term-Interdependencies
With Term-Interdependencies
Immanent
Term-Interdependencies
Transcendent
Term-Interdependencies
Real Term-
Interdependencies
Graph
theoretic/Tensor
Algebra
Set-theoretic
Algebraic
Probabilistic
III. Semantic Information Discovery with Topical Facets
Diagram adapted from: Kuropka, D. (2004). Modelle zur Repräsentation natürlichsprachlicher Dokumente.
Ontologie-basiertes Information-Filtering und -Retrieval mit relationalen Datenbanken (Vol. 10). Berlin (DE):
Logos Verlag.
Extended
Boolean
Generalized
Vector
Space
Vector
Space
Binary
Interdependence
Standard
Boolean
Inference
Network
Belief
Network
Language
Models
Latent
Semantic
Fuzzy
Set
Topical
Facets
Retrieval by
Logical
Imaging
Back Propagation
Neuronal Network
Topic
Based
Vector
Space
Spread.
Activation
Neuronal
Network
Balanced
Topic
Based
Vector
Space
collections
cy
dodndm
vocabulary
texts
documents
community
of readers
domain of
interest
ca
dcdbda
wa, wb, … wx
t
t
tj
ti
ti
dm
tj
dm
source sa
timestamp t
words
tokens
IV. Self-Generating Text Network – Components
 A (labeled) graph G is a 4-tuple G = (V, A, , β)
where:
 V is a set of nodes (vertices), A ⊆ V ×V is a set of
links (arcs) connecting the nodes, is a function
labeling the nodes and β is a function labeling the
arcs.
 Graph size: |G|=|V|+|A|
 One node for each unique term
 If word B follows word A, there is a directed link (arc)
from A to B (different from B →A)
 Punctuation marks are ignored
 Stop words are not removed
 Graph size limits itself (Zipf’s law)
 No stemming
 Alternate forms of the same term are conflated using
a Directed Acyclic Word Graph (DAWG)
 Graph-based approach can outperform traditional
vector-based methods
 Each arc is labeled with a relative frequency metric
 A normalized value in [0,1] is assigned by dividing
each link frequency value by the maximum link
frequency value over the collection subgraph
 These networks have low diameter (like uniform
random networks), but also have the property that
many of the neighbors of a node are themselves
neighbors (unlike uniform random networks).
 A simple underlying structure explains the
presence of most edges, but a few edges are
produced by a random process that does not
respect this structure.
 Many social networks display these properties.
The World Wide Web is the most famous. Natural
language exhibits similar characteristics.
IV. Self-Generating Text Network – The Small-World Model
A self-generating text network is:
 A set of related token-types
 A catalogue of relations
 A fixed series of operations on relations
 Unrestricted: no predetermined constructs
Adding new data is efficient :
 Creation of new relations and nodes at a
diminishing rate (log n)
 No problem with sparse data
 A relation is relevant from the first instance
D
E
F
A
B
C
Collection c1 Collection c2
Two sets of common
words identified in six
documents from two
collections
V. Basic Operations
D
EF
A
B
C
The similarity of dissimilar
words
Q
c1
c2
V. Basic Operations
 Informative value of tokens is based on:
◦ links of the token with other tokens in a
document (link frequency)
◦ inverse relation with the link frequency in a
collection
◦ notion of collection: texts having same source
and same timestamp (or any other meta
criterion) are expected to differentiate their
content
 Finding informative structures: sequences of
minimal two and possible more informative
tokens
V. Basic Operations: Information Value Spectrum
Cluster from NYT Oct. 2, 1998
Informative value: 0,7987
Words with this value: 20
adaptation, astor, baltar, bandoneon, composer's, el,
extended, ferrer's, goblin, instrumental,
instrumentations, instruments, kremer's, unfardo,
plata, poet, splendid, surreal, tangos, vocalists
 Informative n-gram
 Kind of summary component
 Found inside a document
 Found inside a collection
 No human intervention or manual clean-up needed
 Survives outside the collection and document as a
generalization of local knowledge
19981001 NYT 0501 19981001 APW 0538
tobacco-plantation clan
nat turner
turn-of-the-century victorian
good-hearted squalor
darrell larson
birdlike redhead
andie macdowell
bridget terry
neurotic perfectionist
tel aviv suburb
flame retardant
dangerous chemicals aboard
190 liters
dimethyl methylphosphonate
nerve gas
nrc handelsblad
intended recipient
shaul yahalom
nes ziona
19981002 PRI 2000.3309 19981002 NBC 1830.1603
holding addressed simpson
producer minus
overkill sit
sydney’s dance
hinder pastels candidacy
multifaceted characters
des bouquets
pop charts
sensible basics
musical finale
art collector
opposition third
always collected
gay drag
simon hunt
spoiler mendes
scheck raised
completely intimately
young fans rarely questioned
league baseball
bounced along
great horse champion roy
popular phone store
nod yes
best sellers
gene autry
always considered myself
star audrey
producer criticized
million records
businessman owning
los angeles
tv stations
sings mama
recording artist
order gets
VI. Associations
Content analysis of the data leads to a computer
generated document description labeled Topical
Facet.
 Topical because it connects two or more
documents based on a semantic agreement
 Facet because the accord is highly fractional. The
semantic agreement in question is a collection of
shared phrases composed of significant words.
The fact of being fractional relaxes the
requirement of some knowledge-representation
systems to allocate a single topic or a single
concept to a document or a document subdivision.
Assume six documents:
 d1 is about Bill Clinton, Hamas, a bomb factory, Israel, and
Madeleine Albright (BC, HM, BF, IS, and MA)
 d2 is about Bill Clinton, Jonathan Pollard, and Madeleine
Albright (BC, JP, and MA)
 d3 is about Israel, Jonathan Pollard, and the US Navy
(IS, JP, and UN)
 d4 is about Israel, Hamas and a bomb factory
(IS, HM and BF)
 d5 is about the US Navy (UN)
 d6 is about a bomb factory (BF)
Document relations based on shared phrases
BC IS HM JP MA UN BF
d1 √ √ √ √ √
d2 √ √ √
d3 √ √ √
d4 √ √ √
d5 √
d6 √
S {d1,d2} {d1,d3,d4} {d1,d4} {d2,d3} {d1,d2} {d3,d5} {d1,d4,d6}
{d1,d2}
{d1,d3,d4}
{d1,d4}
{d2,d3}
{d1,d2}
{d3,d5}
{d1,d4,d6}
{IS, HM}
{BC, MA}
{JP}
{UN} {HM, BF}
 Every subclass of a set that is defined by a predicate
P(x) is itself a set by the axiom schema of
comprehension:
∀𝐴, ∀𝐸, ∀𝐶: 𝐶 ∈ 𝐵 ⇔ 𝐶 ∈ 𝐴 ∧ 𝑃(𝐶)
In words: given a set A and a predicate P, there is a
subset B of A whose members are precisely the
members of A that satisfy P
 A characteristic function defined on a set D indicates
the membership of an element in a subset A of D
𝑖 𝐴(d) =
1 if 𝑑 ∈ 𝐴
0 if 𝑑 ∉ 𝐴
In words : a document d is a member of a set A if it has
one or more n-grams (arcs) in common with the other
documents of that set.
“Hamas”
{IS, HM}
{HM, BF}
{d1, d3, d4}
{d1, d4, d6}
d1 {BC, IS, HM, MA}
d3 {IS, JP, UN}
d4 {IS, HM}
d6 {HM, BF}
HM
BF
IS
JP, UN
BC, MA
Core information from
the Topical Facets
Additional information
from related documents
Query Topical Facets Documents Information
Answers to a query about “Hamas”
Core Information → Hamas is related to:
- bomb factory
- Israel
Additional facts → Israel is related to:
- Bill Clinton and Madeleine Albright
- Jonathan Pollard en the US Navy
No more information on the bomb factory.
VII. Towards Topical Facets
Context: In 1998 also, many economies in Latin America, Russia
and Asia faced monetary difficulties and more than one bailout
was set up by national and international financial institutions.
Suppose a user is interested in information on:
“A bailout package passed by the Japanese parliament to save
the banking system”
Assume that this query points to three topical facets containing
components known by the system:
 banking system
 Japanese parliament
 bailout package
Activated components
characterizing a topic
bailout
banking system
Japanese parliament
Deactivated topical facet
components
TP
Deactivated topical facet
components
Partial agreement
VIII. Real World Example: Topic Construction
We’re looking for documents defined by the
intersection of three topical facets
 A complete topic is composed with the activated
elements of several topical facets.
 The same topical facet can be used in more than
one topic
 A topical facet is a semantic component shared
by more than one document.
 Topical facet # 179 containing:
[monetary fund, bailout package, latin america’s, northern
portugal, 30 billion, world bank, american nations].
Related documents: 19
 Topical facet # 184 containing:
[interest rates, bailout package, highest level, discount rate,
sustained economic, financial institutions, rates late].
Related documents: 36.
 Topical facet # 229 containing:
[banking system, market rally].
Related documents: 7
 Topical facet # 401 containing:
[banking sector, japan’s upper, upper house]
Related documents: 3.
Retrieving all the documents that are connected by the
bailout facets would overshoot the question implicating
Thailand, Russia and Brazil, but not Japan:
 Document 4897: 19981016_APW0453 – extract
(…) Thailand’s interest rates, although now coming down
slightly, remain among the highest in the region, the legacy of
a tight monetary policy dictated by the terms of a dlrs 17.2
billion International Monetary Fund economic bailout package.
 Document 5140: 19981016_NYT0286 – extract
(…) Since the economic collapse of Russia in August, investors
and economists have sharpened their focus on Brazil. While
President Clinton lobbied Congress for $18 billion to restock
the International Monetary Fund this week, with an eye
toward assisting Brazil, the team that returned here from the
IMF’s annual meeting in Washington earlier this month
scoured government accounts for politically feasible sources of
savings and reform. (…)
Example d5192: 19981016_VOA0600.0197 – full text ASR, as is.
japan’s upper house of parliament has given final approval to a
package of laws the clinton and japan’s ailing banking sector the
new laws also that they five hundred twenty billion dollar fund
of taxpayer dollars to bail out a week but solvent banks earlier
parliament passed a supplemental budget to fund the reforms
japan is that under intense international pressure to find a
banking center there’s been crippled by a series as speculative
and loans made during the nineteen eighty
Example d4859: 19981016_ABC1830.0936 – full text ASR, as is.
still on the money tonight two steps today to address the most
obvious weak spots in the global economy that japanese
government has given final approval for public funds to failing
banks international monetary fund knows it will be getting that
eighteen billion extra dollar from the u. s. when congress and the
president sign the budget agreement next week
 New York Times Newswire Service
 Associated Press Worldstream Service
 CNN Cable News Network
 American Broadcasting Company
 National Broadcasting Company
 MS-NBC
 Public Radio International
 Voice of America
Tested on the 15 million words - partially ASR
transcribed TDT-3 corpus (DARPA)
Topic Detection & Tracking Task
 Algorithms are allowed to employ only the content of
the data plus information about source, date, and time.
 The topic detection task requires systems to group
incoming stories into unsupervised topic clusters,
creating new clusters (topics) as needed, without look
ahead and without deferring the decision.
Example:
“Former Chilean dictator General Augusto Pinochet, who ruled
Chile from 1973-1990, is arrested in a London hospital on a
warrant issued by Spanish Judge Baltasar Garzon on charges of
genocide and torture during his reign”
Term – Facet
Dictionary
T1 {Fa, Fb , Fc, ..}
T2 {Fb, Fd, Fe, …}
Tn {Fa, Fc, Fe, …}
Facet – Document
Dictionary
F1 {Da, Db, Dc, …}
F2 {Dd, Df, Dg, …}
Fn {Db, Dg, Dq,…}
Query Term Query by Proxy
Query-by-proxy uses a true prototype. It is the single
document from a body of data that best represents the
query in the sense that this document would retrieve
the query.
On-topic but overlooked
by human annotator
Pinochet Output
Off-topic with
thematic
resemblance
IX. Document Retrieval Example
Pinochet - First Off Topic Document d10587
The text is about the alleged torturing of Coptic Christians
by the Egyptian police and reactions of local and
international human rights groups.
The document is off-topic but because of the partial
agreement a thematic resemblance at the border of the on-
topic region is apparent:
 The alleged involvement of the authorities in arresting
and torturing citizens
 The international attention of human rights groups
 Similar documents share similar topical facets
 Ranking based on shared facets
 More shared facets means better content similarity
 Three steps:
◦ Core extraction
◦ Document-by-document similarity
◦ Semantic Preference Clustering
X. Retrieval Output: Graph based similarity
Graph G1 = (V1, A1, 1, β1) and graph G2 = (V2, A2, 2, β2)
are isomorphic, expressed as G1≅ G2 if there exists a
bijective function
f :V1→V2 such that
1(x) = 2( f (x)) ∀x ∈ V1 and
β1(x, y) = β2( f (x), f (y)) ∀(x, y) ∈ V1×V1
A
C D
B
B C
DA
X
ZW
Y
Y
Z
W
XG1
G2
Let G, G1 and G2 be graphs. The graph G is a
common subgraph of G1 and G2 if there exist
subgraph isomorphisms from G to G1 and from G
to G2
A
C D
B
B E
FA
X
ZW
Y
R
P
Q
X
G1
G2
B
A
X
G
The graph G is a maximum common subgraph if G
is a common subgraph of G1 and G2 and there exist
no other common subgraph G’ of G1 and G2 such
that |G’| > |G|
A
C D
B
B E
FA
X
ZW
Y
R
P
Q
X
G2
B
A
X
GG1
|G|= |V|+|A| = 2+1 = 3
Doc # Document Label Topical facets in these documents Document
length (tokens)
45 19981001_APW0580 903, 1321, 1399, 2727, 3561, 3566 252
62 19981001_APW0855 1321,1429, 2727, 3640 363
268 19981001_VOA1700.0226 1321,1365, 1399, 2727, 3324 181
269 19981001_VOA1700.0293 1321,1397 73
290 19981001_VOA1700.1985 1241,1397, 2727, 2856 155
311 19981001_VOA1800.0303 1321 47
367 19981002_APW0564 1110, 1321, 1421, 2558, 2727 318
393 19981002_APW1025 903,1241,1321,1397,1399, 1429, 2727,
3566
315
408 19981002_APW1076 903,1110,1241,1321,1365,1397,1399,
1421, 1429, 2558, 2727, 3566
573
609 19981002_VOA1700.0249 1241, 1321, 1365, 3247, 3400 172
632 19981002_VOA1700.2128 1241, 2727 46
672 19981002_VOA1800.2520 1241,1321,1365,1397,1399,2727,2054,
3247, 3400
371
X. Retrieval Output with twelve TDT documents
<DOC>
<DOCNO> VOA19981002.1700.2128 </DOCNO>
<DOCTYPE> NEWS </DOCTYPE>
<TXTTYPE> ASRTEXT </TXTTYPE>
<TEXT>
ISRAEL HAS SEALED ITS BORDERS WITH THE WEST
BANK AND GAZA STRIP AMID WARNINGS THE
MILITANT PALESTINIAN GROUP HAMAS IS PLANNING A
MAJOR ATTACK ON ISRAEL A CLOSURE REMAINS IN
EFFECT UNTIL AT LEAST TUESDAY
AND BANDS PALESTINIAN WORKERS FROM GOING TO
THEIR JOBS IN ISRAEL
</TEXT>
</DOC>
Example
israel
their
sealed
workers
tuesday
bands
closure
amid
warnings
palestinian
hamas
militant
gaza
group
planning
bank
major
borders
effectjobs
going
west
until
attack
remains
least
strip
d632
d311
his
terror
spokesman
moshe
fogel
between
radio
quoted
movement
israel
present
great
authority
seem
bound
their
leader
sealed
hindered
workers
terrorist
tuesday
bonds
taking
territories
yitzhak
mordechai
unprecedented
horrific
make
bands
closure
closures
al
recentwhether
pests
saying
unfair
launc
h
amid
efforts
being
next
serious
cooperation
ineffective
warnings
times
palestinian
hamas
home
there
militant
carry
attacks
israelis
gaza
imposed
decided
group
planning
ways
bank
several
militants
prevent
really
most
major
borders
large
provided
branch
effect
he
wing
believe
hard
them
who
no
armed
good
jobs
going
evidence
jewish
jerusalem
detail
s
fresh
threats
indicated
minister
west
until
continue
stat
e
defense
attack
clear
kno
w
remains
incidents
calls
security
least
shake
ever
reports
spiritual
seriousl
y
haven
candoing
government
guests
tryingvery
strip
our
days
activists
groups
bomb
factory
d609
D632
‘bomb factory’ from
TF1321 links d45, d393,
d408 and d672 but is
not directly related to
any of the three graphs
shown here.
‘horrific attack’ and ‘yitzhak mordechai’ from
TF1241 and TF1321 link d609 with d672 (not
shown).
‘government spokesman’
from TF1321 links d609
with d268, d408 and
d672 (not shown).
Topical
facet
d45 d62 d268 d269 d290 d311 d367 d393 d408 d609 d632 d672 Total
docs
903 x x x 3
1110 x x 2
1241 x x x x x x 6
1321 x x x x x x x x x x 10
1365 x x x x x 5
1397 x x x x 4
1399 x x x x x 5
1421 x x 2
1429 x x x 3
2558 x x 2
2727 x x x x x x x x x 9
2856 x 1
3054 x 1
3247 x x 2
3324 x 1
3400 x x 2
3561 x 1
3566 x x x 3
3640 x 1
Total
facets
6 4 5 2 4 1 5 8 12 5 2 9
Topical facet – document matrix
45-62-268-269-290-311-367-393-408-609-632-672
19 Facets
12 Documents
903-1110-1241-1321-1365-1397-1399-1421-1429-2558-2727-2856-3054-3247-3324-3400-3561-3566 -3640
X. Retrieval Output - Bipartite Rendering of the Facet
Document Relations
X. Retrieval Output - Unipartite Transformation
903
1110
1365
1397
1421
1429
2558
2856
3054
3247
3324
3400
3561
3566
3640
1241
1321
1399
2727
 Let G = (V, L) be a graph. V is the set of vertices
and L is the set of arcs.
 Let n = |V| and m = |L|
 A subgraph Hk = (W, L |W) induced by the set W
is a k-core or a core of order k iff
and Hk is the maximum subgraph with this
property. The core of maximum order is also
called the main core.
( ): deg" Î ³Hv W v k
spy
citizenship
troops
security
intelligence-gathering body
governments disavowed jonathan pollard
many
daily yediot ahronot reported
recently
shadowy
grantedfree
territories
pollard’s fate
palestinian
militants
militant group
leader yasser arafat
militant groups
bomb factory hamas activist
yitzhak mordechai
moshe fogel
early release
madeleine albrightjoseph ralston
horrific attack
mideast summit
middle east
government spokesman
intelligence analyst
clinton agreed
naval intelligence military documentspassing secretrogue operation
possible release
arrested outside officials including cabinet
israeli
X. Retrieval Output – Result of the core extraction
k Frequency Frequency % Topical Facets
3 2 10.5 2856, 3640
4 1 5.3 3324
5 1 5.3 3561
8 3 15.8 3054,3247, 3400
9 12 63.2 all other
Sum 19 100.00
High k:
Interesting facets
X. Retrieval Output - Core Extraction
     similarity Vsim Asim
Text graph T2
Text graph T1
A
B
C
D
F
G
H
E
I
Intersection T1 ∩ T2
A
BC
A
B
C E
K
DM
 
   1 2


in vT
Vsim
n vT n vT
The vertex similarity (Vsim) is a cosine coefficient
that expresses how many vertices two text graphs
have in common.
The arc similarity (Asim) tells how many vertices are
weakly connected in the original graphs. A ‘weakly’
homomorph subgraph conveys more information than
a strict one.
 
   1 2

i i
i
aT aT
m aT
Asim
m aT m aT
 
     1 21 2
 
 
i
i aT aT
n vT
n T m aT m aT
The values of  and  depend on the structure of the
text graphs T1 and T2. Their value depend on the
degree of connection of the shared elements in the
graphs T1 and T2:
1  and
A set D of documents analyzed on their mutual
similarity is stored in a symmetric matrix.
Each node in the matrix contains a similarity value v.
stands for the similarity score between vi and vj
vj is a nearest neighbor to vi if and only if
where h = 1…d and h ≠ I
A semantic preference cluster is the symmetric
transitive closure of the nearest neighbor relation.
X. Retrieval Output – Semantic Preference Clustering
,D i j 
 
    D i j D i h, max ,
Serendipical search or circumstantial roaming is
expecting to face a chance encounter. The user
cannot query without knowing the exact terminology.
Navigating through the content neighborhood offers a
broad view on what plausible answers are available.
Scope 16/10/1998 – 17/10/1998
583 documents 8 US news sources (ABC – APW – CNN – MNB –
NBC – NYT – PRI – VOA)
511 Topical facets largest set : 13 documents, smallest 2
Average facets per doc 2,8
A document is seen in 3,1 topical facts on average.
Topical facet # 4 - 3 documents. Informative weight: 16.5.
General content indication: [ predominantly roman ]
Doc 4892: sinn fein - mairead corrigan - appropriate
laureates - belfast's frigid - ira's terrorist - ulster unionist
- roman catholic
Doc 4905: sinn fein - appropriate laureates - ira's terrorist -
belfast's frigid - mairead corrigan - ulster unionist -
roman catholic
Doc 5007: hardliner coup moderate - quite severe british -
rewarding jerry - congratulate fischer's - ulster unionist -
bitter enemies - tirade cease
Topical facet # 5 - 5 documents. Informative weight: 72.3.
General content indication: [ bertie ahern ]
Doc 4892: sinn fein - mairead corrigan - appropriate
laureates - belfast's frigid - ira's terrorist - ulster
unionist - roman catholic
Doc 4905: sinn fein - appropriate laureates - ira's terrorist
- belfast's frigid - mairead corrigan - ulster unionist -
roman catholic
Doc 5104: fire-breathing speeches condemning - gunman's
getaway - secretive five-man - notable similarities -
violent tribal - fein's entry - dissuade irish-americans
Doc 5154: fire-breathing speeches condemning -
fulminating preacher-politician - canary wharf - notable
similarities - gunman's getaway - downtown portadown
- shimon peres
Doc 5342: restored victorian splendor overlooking - irish-
american alumnus donald - notre dame - program's
classrooms - rapidly expanding cities - co inc - frankly
thrilling
Topical facet # 6 - 5 documents. Informative weight: 72.3.
General content indication: [ sinn fein ]
Doc 4892: sinn fein - mairead corrigan - appropriate
laureates - belfast's frigid - ira's terrorist - ulster
unionist - roman catholic
Doc 4905: sinn fein - appropriate laureates - ira's terrorist
- belfast's frigid - mairead corrigan - ulster unionist -
roman catholic
Doc 5104: fire-breathing speeches condemning - gunman's
getaway - secretive five-man - notable similarities -
violent tribal - fein's entry - dissuade irish-americans
Doc 5154: fire-breathing speeches condemning -
fulminating preacher-politician - canary wharf - notable
similarities - gunman's getaway - downtown portadown
- shimon peres
Doc 5245: politician seamus - sinn fein - moline moderate
Topical facet # 7 - 4 documents. Informative weight: 16.4.
General content indication: [ tony blair ]
Doc 398: gr7 harrier jump jets - gioia del colle -
deployment shows
Doc 4892: sinn fein - mairead corrigan - appropriate
laureates - belfast's frigid - ira's terrorist - ulster
unionist - roman catholic
Doc 4905: sinn fein - appropriate laureates - ira's
terrorist - belfast's frigid - mairead corrigan - ulster
unionist - roman catholic
Doc 5007: hardliner coup moderate - quite severe british -
rewarding jerry - congratulate fischer's - ulster
unionist - bitter enemies - tirade cease
Why is Tony Blair here and
why the Harrier Jets?
Topical facet # 3 - 2 documents. Informative weight: 0.9.
General content indication: [ football association ]
Doc 398: gr7 harrier jump jets - gioia del colle - deployment
shows
Reconstructed text for Doc 398
britain said friday it will send four more fighter-bombers to a nato
base in southern italy as part of the buildup over the serbian
province of kosovo defense secretary george robertson said the
gr7 harrier jump jets will fly monday from laarbruch a british
base in germany to gioia del colle
(…) england canceled a soccer match with yugoslavia which was
to have been played in london nov 18 the english football
association said it acted now to end uncertainty over the fixture
(…) prime minister tony blair again warned of military reprisals if
attacks against ethnic albanians blamed on serbs continued in
kosovo
Serendipical insight is obtained as a consequence of
finding a link outside a system, able to bridge two
previously unrelated subgraphs pertaining to a problem P
inside that system.
P
a
q
a’
q’
XI. Serendipical & Approximate Search
XI. Serendipical & Approximate Search
What if the computer formulates queries of her own?
• Random token generator generates a string
• The random string is submitted as a query
• Most of the time without result
• Sometimes an unexpected relation pops up
XI. Serendipical & Approximate Search
Relation found with a random query generator
NYT19981001.0277
No doubt John Boorman's canny, elegant new film “The General”,
about the notorious Irish thief Martin Cahill, hits unusually close to
home. (…) Cahill's biographer, Paul Williams, maintains that when
Cahill tried to enlist in the Navy at 15, in 1964, and filled out an
application: “Martin chose the Position of bugler”. Unfortunately,
due to his difficulties in school, he misread the word as
“burglar”.
NYT19981003.0145 & NYT19981003.0220
(…) “We're placing a high value on learning another language”,
said Jay Doolan, the director of standards and professional
development with the New Jersey Department of Education. “We feel
it's not only appropriate for competing in a shrinking world, but that
learning another language also helps you do better in your first
language.''
Because only the token-types require processing, the application
uses logarithmic time O(log n), where n is the number of vertices.
The algorithm to calculate the topical facets visits all informative
arcs in constant time. It contributes O(m) at most, where m is the
number of arcs. The total time complexity of the application with
regard to the network and the topical facet layer building is O(m +
log n).
Since it is executed for each arc of the network G at most, the total
time complexity of a topical search is O(max(m, n)).
In a connected network m  n – 1, O(max(m, n)) = O(m). In real
situations the input size is expected to be md << m.
XII. Computational Complexity
r 2
= 0.98
80000
130000
180000
230000
280000
330000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
XII. Computational Complexity – Diminishing growth rate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0,02…
4,09…
8,17…
12,2…
16,3…
20,3…
24,4…
28,5…
32,6…
36,6…
40,7…
44,8…
48,9…
52,9…
57,0…
61,1…
65,2…
69,2…
73,3…
77,4…
81,5…
85,5…
89,6…
93,7…
97,7…
Token-types
123 (2.6%) different token-
types produce 50% of the links
XII. Computational Complexity – Heavily clustered
 Topical Facets provide a flexible document
description
 Batch operation on bulk data is possible
 Incremental information density
 Time accounted for  decaying information value:
information lifecycle and information half-life
 Topical Facet layer exists independently from the user
and prior to any request
 New approach: nobody uses an unrestricted text
network for information discovery, retrieval or content
management, yet…
“UTAA (Unstructured Text Acquisition and Analysis) thus
seems poised for rapid entry into the mainstream of business
information systems applications”
Kuechler, W. L. (2007). Business Applications of Unstructured Text. Communications of the ACM,
50(10), 86-93.
UTAA Application Textual Data Source
Business Intelligence Web, Industry blogs, online databases
Customer relationship
management
Customer feedback, help desk reports
Regulatory compliance
Internally generated electronic
documents
Intellectual property management Web, copyright and patent databases
Call support (help desk
applications)
Call documentation, customer
feedback, email, online manuals
Accounts payable/receivable
analysis
Invoices, customer and vendor
correspondence
Legal department support
Legal databases, specific streams of
organizational communications
Research support Domain specific journals
Securities and asset management Unstructured economic data
General news monitoring News streams
“ Interpretation would be impossible if the life-
expressions were totally alien.
It would be unnecessary if there were nothing alien
about them.
It must therefore lie between the two extremes. ”
Wilhelm Dilthey, 1833-1911
The Ghent police commissioner Steven De Smet had to appear in court because in June 2011
he transferred an internal report about football riots to a CD&V politician. The politician leaked
its contents to the media. Subsequently, a complaint for breach of professional secrecy was
filed. The investigation led to the mailbox of De Smet.
(DS March 31,2012)

Topical_Facets

  • 2.
    I. Problems with InformationRetrieval II. What Do We Want? III. Standard Techniques IV. Self-GeneratingText Network V. Basic Operations VI. Associations VII. Topical Facets VIII. Real World Example IX. Document Retrieval X. Retrieval Output XI. Serendipical Search XII. Computational Complexity XIII. Final Remarks
  • 3.
     If youdon’t know the data, you can’t even start to search  To know a search word might result in the return of an overload of out-of-context documents  Having a word without knowing its synonyms and derivations results in overlooking potentially relevant data  Documents that are only conceptually related, are not retrieved
  • 4.
    The practice ofcommitting huge data repositories to manually tuned trees or tables with standard taxonomies/ontologies is:  unaffordable due to high cost in money and time  has the undesirable side effect of curtailing the creativeness of its users (search for what you expect to find).
  • 5.
     The resultingdimensions can be difficult to interpret. For example, in: {(car), (truck), (flower)} {(1.3452 * car + 0.2828 * truck), flower)} the component (1.3452 * car + 0.2828 * truck) could be interpreted as a vehicle. However, cases occur such as: {(car), (bottle), (flower)} {(1.3452 * car + 0.2828 * bottle), flower)} what can be justified mathematically, but has no meaning in natural language.  The probabilistic model of LSA does not match the observed data: LSA assumes that words and documents form a Gaussian model (ergodic hypothesis).
  • 6.
    One or moreof the following:  Eliminate search altogether - as far as possible  Enable unsupervised discovery of unknown things  Let the system organize the documents (no tagging, less paper shuffling)  Enrich the available structured information automatically  …
  • 7.
    III. Standard ClassificationTechniques: Bag-Of-Words model Latent Semantic Analysis k-nearest neighbor Support Vector Machine Radial Basis Function kernel Diagram by: Cardoso-Cachopo, A., & Oliveira, A. L. (2003). An Empirical Comparison of Text Categorization Methods. Paper presented at the 10th International Symposium on String Processing and Information Retrieval (SPIRE), Manaus (BR).
  • 8.
     Ignoring theword position in the document  Ignoring the ordering of words in the document  Language production is not Gaussian  Ignoring the information contained in HTML tags  Multilingual documents  Word separation/morphology may be tricky (e.g., German, Chinese, etc.)  No comprehensive evaluation on large non-English corpora
  • 9.
    New York Times'use of token-types in five consecutive days Oct. 1-5, 1998 used once 67% two days 14% three days 7% four days 5% five days 7% used once 71% two days 14% three days 6% four days 4% five days 5% CNN's use of token-types in five consecutive days Dec. 21-25, 1998 Human language is not a finite state machine that allows the prediction of the transitions of a finite set of elements with a finite set of states
  • 10.
    The curse ofdimensionality
  • 11.
    The curse ofdimensionality 215 steps 100+ dimensions?
  • 12.
    1 2 3 4 Relationship 'ash- volcanos' accessible in four steps in a network
  • 13.
    Properties of the Model Mathematical Basis Without Term-Interdependencies WithTerm-Interdependencies Immanent Term-Interdependencies Transcendent Term-Interdependencies Real Term- Interdependencies Graph theoretic/Tensor Algebra Set-theoretic Algebraic Probabilistic III. Semantic Information Discovery with Topical Facets Diagram adapted from: Kuropka, D. (2004). Modelle zur Repräsentation natürlichsprachlicher Dokumente. Ontologie-basiertes Information-Filtering und -Retrieval mit relationalen Datenbanken (Vol. 10). Berlin (DE): Logos Verlag. Extended Boolean Generalized Vector Space Vector Space Binary Interdependence Standard Boolean Inference Network Belief Network Language Models Latent Semantic Fuzzy Set Topical Facets Retrieval by Logical Imaging Back Propagation Neuronal Network Topic Based Vector Space Spread. Activation Neuronal Network Balanced Topic Based Vector Space
  • 14.
    collections cy dodndm vocabulary texts documents community of readers domain of interest ca dcdbda wa,wb, … wx t t tj ti ti dm tj dm source sa timestamp t words tokens IV. Self-Generating Text Network – Components
  • 15.
     A (labeled)graph G is a 4-tuple G = (V, A, , β) where:  V is a set of nodes (vertices), A ⊆ V ×V is a set of links (arcs) connecting the nodes, is a function labeling the nodes and β is a function labeling the arcs.  Graph size: |G|=|V|+|A|
  • 16.
     One nodefor each unique term  If word B follows word A, there is a directed link (arc) from A to B (different from B →A)  Punctuation marks are ignored  Stop words are not removed  Graph size limits itself (Zipf’s law)  No stemming  Alternate forms of the same term are conflated using a Directed Acyclic Word Graph (DAWG)  Graph-based approach can outperform traditional vector-based methods
  • 17.
     Each arcis labeled with a relative frequency metric  A normalized value in [0,1] is assigned by dividing each link frequency value by the maximum link frequency value over the collection subgraph
  • 18.
     These networkshave low diameter (like uniform random networks), but also have the property that many of the neighbors of a node are themselves neighbors (unlike uniform random networks).  A simple underlying structure explains the presence of most edges, but a few edges are produced by a random process that does not respect this structure.  Many social networks display these properties. The World Wide Web is the most famous. Natural language exhibits similar characteristics. IV. Self-Generating Text Network – The Small-World Model
  • 19.
    A self-generating textnetwork is:  A set of related token-types  A catalogue of relations  A fixed series of operations on relations  Unrestricted: no predetermined constructs Adding new data is efficient :  Creation of new relations and nodes at a diminishing rate (log n)  No problem with sparse data  A relation is relevant from the first instance
  • 20.
    D E F A B C Collection c1 Collectionc2 Two sets of common words identified in six documents from two collections V. Basic Operations
  • 21.
    D EF A B C The similarity ofdissimilar words Q c1 c2 V. Basic Operations
  • 22.
     Informative valueof tokens is based on: ◦ links of the token with other tokens in a document (link frequency) ◦ inverse relation with the link frequency in a collection ◦ notion of collection: texts having same source and same timestamp (or any other meta criterion) are expected to differentiate their content  Finding informative structures: sequences of minimal two and possible more informative tokens
  • 23.
    V. Basic Operations:Information Value Spectrum
  • 24.
    Cluster from NYTOct. 2, 1998 Informative value: 0,7987 Words with this value: 20 adaptation, astor, baltar, bandoneon, composer's, el, extended, ferrer's, goblin, instrumental, instrumentations, instruments, kremer's, unfardo, plata, poet, splendid, surreal, tangos, vocalists
  • 25.
     Informative n-gram Kind of summary component  Found inside a document  Found inside a collection  No human intervention or manual clean-up needed  Survives outside the collection and document as a generalization of local knowledge
  • 26.
    19981001 NYT 050119981001 APW 0538 tobacco-plantation clan nat turner turn-of-the-century victorian good-hearted squalor darrell larson birdlike redhead andie macdowell bridget terry neurotic perfectionist tel aviv suburb flame retardant dangerous chemicals aboard 190 liters dimethyl methylphosphonate nerve gas nrc handelsblad intended recipient shaul yahalom nes ziona 19981002 PRI 2000.3309 19981002 NBC 1830.1603 holding addressed simpson producer minus overkill sit sydney’s dance hinder pastels candidacy multifaceted characters des bouquets pop charts sensible basics musical finale art collector opposition third always collected gay drag simon hunt spoiler mendes scheck raised completely intimately young fans rarely questioned league baseball bounced along great horse champion roy popular phone store nod yes best sellers gene autry always considered myself star audrey producer criticized million records businessman owning los angeles tv stations sings mama recording artist order gets VI. Associations
  • 27.
    Content analysis ofthe data leads to a computer generated document description labeled Topical Facet.  Topical because it connects two or more documents based on a semantic agreement  Facet because the accord is highly fractional. The semantic agreement in question is a collection of shared phrases composed of significant words. The fact of being fractional relaxes the requirement of some knowledge-representation systems to allocate a single topic or a single concept to a document or a document subdivision.
  • 28.
    Assume six documents: d1 is about Bill Clinton, Hamas, a bomb factory, Israel, and Madeleine Albright (BC, HM, BF, IS, and MA)  d2 is about Bill Clinton, Jonathan Pollard, and Madeleine Albright (BC, JP, and MA)  d3 is about Israel, Jonathan Pollard, and the US Navy (IS, JP, and UN)  d4 is about Israel, Hamas and a bomb factory (IS, HM and BF)  d5 is about the US Navy (UN)  d6 is about a bomb factory (BF)
  • 29.
    Document relations basedon shared phrases BC IS HM JP MA UN BF d1 √ √ √ √ √ d2 √ √ √ d3 √ √ √ d4 √ √ √ d5 √ d6 √ S {d1,d2} {d1,d3,d4} {d1,d4} {d2,d3} {d1,d2} {d3,d5} {d1,d4,d6}
  • 30.
  • 31.
     Every subclassof a set that is defined by a predicate P(x) is itself a set by the axiom schema of comprehension: ∀𝐴, ∀𝐸, ∀𝐶: 𝐶 ∈ 𝐵 ⇔ 𝐶 ∈ 𝐴 ∧ 𝑃(𝐶) In words: given a set A and a predicate P, there is a subset B of A whose members are precisely the members of A that satisfy P  A characteristic function defined on a set D indicates the membership of an element in a subset A of D 𝑖 𝐴(d) = 1 if 𝑑 ∈ 𝐴 0 if 𝑑 ∉ 𝐴 In words : a document d is a member of a set A if it has one or more n-grams (arcs) in common with the other documents of that set.
  • 32.
    “Hamas” {IS, HM} {HM, BF} {d1,d3, d4} {d1, d4, d6} d1 {BC, IS, HM, MA} d3 {IS, JP, UN} d4 {IS, HM} d6 {HM, BF} HM BF IS JP, UN BC, MA Core information from the Topical Facets Additional information from related documents Query Topical Facets Documents Information
  • 33.
    Answers to aquery about “Hamas” Core Information → Hamas is related to: - bomb factory - Israel Additional facts → Israel is related to: - Bill Clinton and Madeleine Albright - Jonathan Pollard en the US Navy No more information on the bomb factory. VII. Towards Topical Facets
  • 34.
    Context: In 1998also, many economies in Latin America, Russia and Asia faced monetary difficulties and more than one bailout was set up by national and international financial institutions. Suppose a user is interested in information on: “A bailout package passed by the Japanese parliament to save the banking system” Assume that this query points to three topical facets containing components known by the system:  banking system  Japanese parliament  bailout package
  • 35.
    Activated components characterizing atopic bailout banking system Japanese parliament Deactivated topical facet components TP Deactivated topical facet components Partial agreement VIII. Real World Example: Topic Construction We’re looking for documents defined by the intersection of three topical facets
  • 36.
     A completetopic is composed with the activated elements of several topical facets.  The same topical facet can be used in more than one topic  A topical facet is a semantic component shared by more than one document.
  • 37.
     Topical facet# 179 containing: [monetary fund, bailout package, latin america’s, northern portugal, 30 billion, world bank, american nations]. Related documents: 19  Topical facet # 184 containing: [interest rates, bailout package, highest level, discount rate, sustained economic, financial institutions, rates late]. Related documents: 36.  Topical facet # 229 containing: [banking system, market rally]. Related documents: 7  Topical facet # 401 containing: [banking sector, japan’s upper, upper house] Related documents: 3.
  • 38.
    Retrieving all thedocuments that are connected by the bailout facets would overshoot the question implicating Thailand, Russia and Brazil, but not Japan:  Document 4897: 19981016_APW0453 – extract (…) Thailand’s interest rates, although now coming down slightly, remain among the highest in the region, the legacy of a tight monetary policy dictated by the terms of a dlrs 17.2 billion International Monetary Fund economic bailout package.  Document 5140: 19981016_NYT0286 – extract (…) Since the economic collapse of Russia in August, investors and economists have sharpened their focus on Brazil. While President Clinton lobbied Congress for $18 billion to restock the International Monetary Fund this week, with an eye toward assisting Brazil, the team that returned here from the IMF’s annual meeting in Washington earlier this month scoured government accounts for politically feasible sources of savings and reform. (…)
  • 39.
    Example d5192: 19981016_VOA0600.0197– full text ASR, as is. japan’s upper house of parliament has given final approval to a package of laws the clinton and japan’s ailing banking sector the new laws also that they five hundred twenty billion dollar fund of taxpayer dollars to bail out a week but solvent banks earlier parliament passed a supplemental budget to fund the reforms japan is that under intense international pressure to find a banking center there’s been crippled by a series as speculative and loans made during the nineteen eighty Example d4859: 19981016_ABC1830.0936 – full text ASR, as is. still on the money tonight two steps today to address the most obvious weak spots in the global economy that japanese government has given final approval for public funds to failing banks international monetary fund knows it will be getting that eighteen billion extra dollar from the u. s. when congress and the president sign the budget agreement next week
  • 40.
     New YorkTimes Newswire Service  Associated Press Worldstream Service  CNN Cable News Network  American Broadcasting Company  National Broadcasting Company  MS-NBC  Public Radio International  Voice of America Tested on the 15 million words - partially ASR transcribed TDT-3 corpus (DARPA)
  • 41.
    Topic Detection &Tracking Task  Algorithms are allowed to employ only the content of the data plus information about source, date, and time.  The topic detection task requires systems to group incoming stories into unsupervised topic clusters, creating new clusters (topics) as needed, without look ahead and without deferring the decision. Example: “Former Chilean dictator General Augusto Pinochet, who ruled Chile from 1973-1990, is arrested in a London hospital on a warrant issued by Spanish Judge Baltasar Garzon on charges of genocide and torture during his reign”
  • 42.
    Term – Facet Dictionary T1{Fa, Fb , Fc, ..} T2 {Fb, Fd, Fe, …} Tn {Fa, Fc, Fe, …} Facet – Document Dictionary F1 {Da, Db, Dc, …} F2 {Dd, Df, Dg, …} Fn {Db, Dg, Dq,…} Query Term Query by Proxy Query-by-proxy uses a true prototype. It is the single document from a body of data that best represents the query in the sense that this document would retrieve the query.
  • 43.
    On-topic but overlooked byhuman annotator Pinochet Output Off-topic with thematic resemblance IX. Document Retrieval Example
  • 44.
    Pinochet - FirstOff Topic Document d10587 The text is about the alleged torturing of Coptic Christians by the Egyptian police and reactions of local and international human rights groups. The document is off-topic but because of the partial agreement a thematic resemblance at the border of the on- topic region is apparent:  The alleged involvement of the authorities in arresting and torturing citizens  The international attention of human rights groups
  • 45.
     Similar documentsshare similar topical facets  Ranking based on shared facets  More shared facets means better content similarity  Three steps: ◦ Core extraction ◦ Document-by-document similarity ◦ Semantic Preference Clustering X. Retrieval Output: Graph based similarity
  • 46.
    Graph G1 =(V1, A1, 1, β1) and graph G2 = (V2, A2, 2, β2) are isomorphic, expressed as G1≅ G2 if there exists a bijective function f :V1→V2 such that 1(x) = 2( f (x)) ∀x ∈ V1 and β1(x, y) = β2( f (x), f (y)) ∀(x, y) ∈ V1×V1 A C D B B C DA X ZW Y Y Z W XG1 G2
  • 47.
    Let G, G1and G2 be graphs. The graph G is a common subgraph of G1 and G2 if there exist subgraph isomorphisms from G to G1 and from G to G2 A C D B B E FA X ZW Y R P Q X G1 G2 B A X G
  • 48.
    The graph Gis a maximum common subgraph if G is a common subgraph of G1 and G2 and there exist no other common subgraph G’ of G1 and G2 such that |G’| > |G| A C D B B E FA X ZW Y R P Q X G2 B A X GG1 |G|= |V|+|A| = 2+1 = 3
  • 49.
    Doc # DocumentLabel Topical facets in these documents Document length (tokens) 45 19981001_APW0580 903, 1321, 1399, 2727, 3561, 3566 252 62 19981001_APW0855 1321,1429, 2727, 3640 363 268 19981001_VOA1700.0226 1321,1365, 1399, 2727, 3324 181 269 19981001_VOA1700.0293 1321,1397 73 290 19981001_VOA1700.1985 1241,1397, 2727, 2856 155 311 19981001_VOA1800.0303 1321 47 367 19981002_APW0564 1110, 1321, 1421, 2558, 2727 318 393 19981002_APW1025 903,1241,1321,1397,1399, 1429, 2727, 3566 315 408 19981002_APW1076 903,1110,1241,1321,1365,1397,1399, 1421, 1429, 2558, 2727, 3566 573 609 19981002_VOA1700.0249 1241, 1321, 1365, 3247, 3400 172 632 19981002_VOA1700.2128 1241, 2727 46 672 19981002_VOA1800.2520 1241,1321,1365,1397,1399,2727,2054, 3247, 3400 371 X. Retrieval Output with twelve TDT documents
  • 50.
    <DOC> <DOCNO> VOA19981002.1700.2128 </DOCNO> <DOCTYPE>NEWS </DOCTYPE> <TXTTYPE> ASRTEXT </TXTTYPE> <TEXT> ISRAEL HAS SEALED ITS BORDERS WITH THE WEST BANK AND GAZA STRIP AMID WARNINGS THE MILITANT PALESTINIAN GROUP HAMAS IS PLANNING A MAJOR ATTACK ON ISRAEL A CLOSURE REMAINS IN EFFECT UNTIL AT LEAST TUESDAY AND BANDS PALESTINIAN WORKERS FROM GOING TO THEIR JOBS IN ISRAEL </TEXT> </DOC> Example
  • 51.
  • 52.
    d311 his terror spokesman moshe fogel between radio quoted movement israel present great authority seem bound their leader sealed hindered workers terrorist tuesday bonds taking territories yitzhak mordechai unprecedented horrific make bands closure closures al recentwhether pests saying unfair launc h amid efforts being next serious cooperation ineffective warnings times palestinian hamas home there militant carry attacks israelis gaza imposed decided group planning ways bank several militants prevent really most major borders large provided branch effect he wing believe hard them who no armed good jobs going evidence jewish jerusalem detail s fresh threats indicated minister west until continue stat e defense attack clear kno w remains incidents calls security least shake ever reports spiritual seriousl y haven candoing government guests tryingvery strip our days activists groups bomb factory d609 D632 ‘bomb factory’ from TF1321links d45, d393, d408 and d672 but is not directly related to any of the three graphs shown here. ‘horrific attack’ and ‘yitzhak mordechai’ from TF1241 and TF1321 link d609 with d672 (not shown). ‘government spokesman’ from TF1321 links d609 with d268, d408 and d672 (not shown).
  • 53.
    Topical facet d45 d62 d268d269 d290 d311 d367 d393 d408 d609 d632 d672 Total docs 903 x x x 3 1110 x x 2 1241 x x x x x x 6 1321 x x x x x x x x x x 10 1365 x x x x x 5 1397 x x x x 4 1399 x x x x x 5 1421 x x 2 1429 x x x 3 2558 x x 2 2727 x x x x x x x x x 9 2856 x 1 3054 x 1 3247 x x 2 3324 x 1 3400 x x 2 3561 x 1 3566 x x x 3 3640 x 1 Total facets 6 4 5 2 4 1 5 8 12 5 2 9 Topical facet – document matrix
  • 54.
  • 55.
    X. Retrieval Output- Unipartite Transformation 903 1110 1365 1397 1421 1429 2558 2856 3054 3247 3324 3400 3561 3566 3640 1241 1321 1399 2727
  • 56.
     Let G= (V, L) be a graph. V is the set of vertices and L is the set of arcs.  Let n = |V| and m = |L|  A subgraph Hk = (W, L |W) induced by the set W is a k-core or a core of order k iff and Hk is the maximum subgraph with this property. The core of maximum order is also called the main core. ( ): deg" Î ³Hv W v k
  • 57.
    spy citizenship troops security intelligence-gathering body governments disavowedjonathan pollard many daily yediot ahronot reported recently shadowy grantedfree territories pollard’s fate palestinian militants militant group leader yasser arafat militant groups bomb factory hamas activist yitzhak mordechai moshe fogel early release madeleine albrightjoseph ralston horrific attack mideast summit middle east government spokesman intelligence analyst clinton agreed naval intelligence military documentspassing secretrogue operation possible release arrested outside officials including cabinet israeli X. Retrieval Output – Result of the core extraction
  • 58.
    k Frequency Frequency% Topical Facets 3 2 10.5 2856, 3640 4 1 5.3 3324 5 1 5.3 3561 8 3 15.8 3054,3247, 3400 9 12 63.2 all other Sum 19 100.00 High k: Interesting facets X. Retrieval Output - Core Extraction
  • 59.
        similarity Vsim Asim Text graph T2 Text graph T1 A B C D F G H E I Intersection T1 ∩ T2 A BC A B C E K DM
  • 60.
        1 2   in vT Vsim n vT n vT The vertex similarity (Vsim) is a cosine coefficient that expresses how many vertices two text graphs have in common. The arc similarity (Asim) tells how many vertices are weakly connected in the original graphs. A ‘weakly’ homomorph subgraph conveys more information than a strict one.      1 2  i i i aT aT m aT Asim m aT m aT
  • 61.
          1 21 2     i i aT aT n vT n T m aT m aT The values of  and  depend on the structure of the text graphs T1 and T2. Their value depend on the degree of connection of the shared elements in the graphs T1 and T2: 1  and
  • 62.
    A set Dof documents analyzed on their mutual similarity is stored in a symmetric matrix. Each node in the matrix contains a similarity value v. stands for the similarity score between vi and vj vj is a nearest neighbor to vi if and only if where h = 1…d and h ≠ I A semantic preference cluster is the symmetric transitive closure of the nearest neighbor relation. X. Retrieval Output – Semantic Preference Clustering ,D i j        D i j D i h, max ,
  • 63.
    Serendipical search orcircumstantial roaming is expecting to face a chance encounter. The user cannot query without knowing the exact terminology. Navigating through the content neighborhood offers a broad view on what plausible answers are available. Scope 16/10/1998 – 17/10/1998 583 documents 8 US news sources (ABC – APW – CNN – MNB – NBC – NYT – PRI – VOA) 511 Topical facets largest set : 13 documents, smallest 2 Average facets per doc 2,8 A document is seen in 3,1 topical facts on average.
  • 64.
    Topical facet #4 - 3 documents. Informative weight: 16.5. General content indication: [ predominantly roman ] Doc 4892: sinn fein - mairead corrigan - appropriate laureates - belfast's frigid - ira's terrorist - ulster unionist - roman catholic Doc 4905: sinn fein - appropriate laureates - ira's terrorist - belfast's frigid - mairead corrigan - ulster unionist - roman catholic Doc 5007: hardliner coup moderate - quite severe british - rewarding jerry - congratulate fischer's - ulster unionist - bitter enemies - tirade cease
  • 65.
    Topical facet #5 - 5 documents. Informative weight: 72.3. General content indication: [ bertie ahern ] Doc 4892: sinn fein - mairead corrigan - appropriate laureates - belfast's frigid - ira's terrorist - ulster unionist - roman catholic Doc 4905: sinn fein - appropriate laureates - ira's terrorist - belfast's frigid - mairead corrigan - ulster unionist - roman catholic Doc 5104: fire-breathing speeches condemning - gunman's getaway - secretive five-man - notable similarities - violent tribal - fein's entry - dissuade irish-americans Doc 5154: fire-breathing speeches condemning - fulminating preacher-politician - canary wharf - notable similarities - gunman's getaway - downtown portadown - shimon peres Doc 5342: restored victorian splendor overlooking - irish- american alumnus donald - notre dame - program's classrooms - rapidly expanding cities - co inc - frankly thrilling
  • 66.
    Topical facet #6 - 5 documents. Informative weight: 72.3. General content indication: [ sinn fein ] Doc 4892: sinn fein - mairead corrigan - appropriate laureates - belfast's frigid - ira's terrorist - ulster unionist - roman catholic Doc 4905: sinn fein - appropriate laureates - ira's terrorist - belfast's frigid - mairead corrigan - ulster unionist - roman catholic Doc 5104: fire-breathing speeches condemning - gunman's getaway - secretive five-man - notable similarities - violent tribal - fein's entry - dissuade irish-americans Doc 5154: fire-breathing speeches condemning - fulminating preacher-politician - canary wharf - notable similarities - gunman's getaway - downtown portadown - shimon peres Doc 5245: politician seamus - sinn fein - moline moderate
  • 67.
    Topical facet #7 - 4 documents. Informative weight: 16.4. General content indication: [ tony blair ] Doc 398: gr7 harrier jump jets - gioia del colle - deployment shows Doc 4892: sinn fein - mairead corrigan - appropriate laureates - belfast's frigid - ira's terrorist - ulster unionist - roman catholic Doc 4905: sinn fein - appropriate laureates - ira's terrorist - belfast's frigid - mairead corrigan - ulster unionist - roman catholic Doc 5007: hardliner coup moderate - quite severe british - rewarding jerry - congratulate fischer's - ulster unionist - bitter enemies - tirade cease Why is Tony Blair here and why the Harrier Jets?
  • 68.
    Topical facet #3 - 2 documents. Informative weight: 0.9. General content indication: [ football association ] Doc 398: gr7 harrier jump jets - gioia del colle - deployment shows Reconstructed text for Doc 398 britain said friday it will send four more fighter-bombers to a nato base in southern italy as part of the buildup over the serbian province of kosovo defense secretary george robertson said the gr7 harrier jump jets will fly monday from laarbruch a british base in germany to gioia del colle (…) england canceled a soccer match with yugoslavia which was to have been played in london nov 18 the english football association said it acted now to end uncertainty over the fixture (…) prime minister tony blair again warned of military reprisals if attacks against ethnic albanians blamed on serbs continued in kosovo
  • 69.
    Serendipical insight isobtained as a consequence of finding a link outside a system, able to bridge two previously unrelated subgraphs pertaining to a problem P inside that system. P a q a’ q’ XI. Serendipical & Approximate Search
  • 70.
    XI. Serendipical &Approximate Search What if the computer formulates queries of her own? • Random token generator generates a string • The random string is submitted as a query • Most of the time without result • Sometimes an unexpected relation pops up
  • 71.
    XI. Serendipical &Approximate Search Relation found with a random query generator NYT19981001.0277 No doubt John Boorman's canny, elegant new film “The General”, about the notorious Irish thief Martin Cahill, hits unusually close to home. (…) Cahill's biographer, Paul Williams, maintains that when Cahill tried to enlist in the Navy at 15, in 1964, and filled out an application: “Martin chose the Position of bugler”. Unfortunately, due to his difficulties in school, he misread the word as “burglar”. NYT19981003.0145 & NYT19981003.0220 (…) “We're placing a high value on learning another language”, said Jay Doolan, the director of standards and professional development with the New Jersey Department of Education. “We feel it's not only appropriate for competing in a shrinking world, but that learning another language also helps you do better in your first language.''
  • 72.
    Because only thetoken-types require processing, the application uses logarithmic time O(log n), where n is the number of vertices. The algorithm to calculate the topical facets visits all informative arcs in constant time. It contributes O(m) at most, where m is the number of arcs. The total time complexity of the application with regard to the network and the topical facet layer building is O(m + log n). Since it is executed for each arc of the network G at most, the total time complexity of a topical search is O(max(m, n)). In a connected network m  n – 1, O(max(m, n)) = O(m). In real situations the input size is expected to be md << m. XII. Computational Complexity
  • 73.
    r 2 = 0.98 80000 130000 180000 230000 280000 330000 13 5 7 9 11 13 15 17 19 21 23 25 27 29 XII. Computational Complexity – Diminishing growth rate
  • 74.
  • 75.
     Topical Facetsprovide a flexible document description  Batch operation on bulk data is possible  Incremental information density  Time accounted for  decaying information value: information lifecycle and information half-life  Topical Facet layer exists independently from the user and prior to any request  New approach: nobody uses an unrestricted text network for information discovery, retrieval or content management, yet…
  • 76.
    “UTAA (Unstructured TextAcquisition and Analysis) thus seems poised for rapid entry into the mainstream of business information systems applications” Kuechler, W. L. (2007). Business Applications of Unstructured Text. Communications of the ACM, 50(10), 86-93.
  • 77.
    UTAA Application TextualData Source Business Intelligence Web, Industry blogs, online databases Customer relationship management Customer feedback, help desk reports Regulatory compliance Internally generated electronic documents Intellectual property management Web, copyright and patent databases Call support (help desk applications) Call documentation, customer feedback, email, online manuals Accounts payable/receivable analysis Invoices, customer and vendor correspondence Legal department support Legal databases, specific streams of organizational communications Research support Domain specific journals Securities and asset management Unstructured economic data General news monitoring News streams
  • 78.
    “ Interpretation wouldbe impossible if the life- expressions were totally alien. It would be unnecessary if there were nothing alien about them. It must therefore lie between the two extremes. ” Wilhelm Dilthey, 1833-1911
  • 79.
    The Ghent policecommissioner Steven De Smet had to appear in court because in June 2011 he transferred an internal report about football riots to a CD&V politician. The politician leaked its contents to the media. Subsequently, a complaint for breach of professional secrecy was filed. The investigation led to the mailbox of De Smet. (DS March 31,2012)