A hierarchical approach for semi structured document indexing and

University of Tunis El Manar, Faculty of sciences of Tunis

A hierarchical Approach for Semi-Structured
Document Indexing and Terminology
Extraction

By Ibrahim Bounhas & Yahya Slimani

Outline
•
•
•
–
–
–
–
•

Introduction
Related work
The proposed approach
Logical Structure Extraction
Hierarchical Indexing
Logical Structure Mining
Semantic Relations Inferring
Experimentation and Results

Introduction
•

Structured information retrieval:
Documents have a structure (e.g. papers,
books, encyclopaedia):
1. How to index documents and fragments?
2. How to exploit the structure for terminology
extraction and organisation?
3. How to model queries?
4. What to retrieve? a document, a section, a
paragraph,…,etc.
5. How to evaluate results?

Related work
• Method of indexing
– Disjoint units: ignore the structure of documents
 Unable to characterize and retrieve internal
nodes.
– Path indexing: Path indexing consists in identifying
terms that discriminate each path
 Unable to characterize and retrieve internal
nodes
 Unable to discriminate nodes having the same
path and different content.
– Bottom-up
index
propagation:
terms
are
propagated from child nodes to their parents using
statistical measures (e.g. tf-idf, entropy)
 Unable to distinguish terms which index
composite nodes having the same level.
 Terms weighting does not take into account levels
of text nodes

Type of Structural Information
• Type of Structural Information
– Physical structure: physical attributes constitute the
context of text nodes: e.g. HTML or XHTML tags. 
tags are very ambiguous: the same tag may have
different usages for different authors.
– Content tags: XML tags describe the logical role of
each node  logical tags does not characterize
semantically the content.
– Logical structure: the hierarchy of titles of the
document is explicitly represented.

Structure Based Knowledge Extraction
• Existing approaches:
– Extract instances of the concepts of a given ontology
(Kuo et al, 2006)
 Need an existing ontology + highly structured
documents.
– Concept extraction based on HTML tags (Kruschwitz,
2005; Karoui, 2008)
 No relations between concepts.
– Co-hyponyms detection from XHTML pages (Brunzel
and Spiliopoulou, 2006)
 No vertical relations

1.

Logical Structure Extraction (Bounhas and
Slimani, 2009a):
–

Physical analyser: identify physical blocks and font
styles.
–
Macro-logical analyser: attribute a level to each text
block.
 The title of the document gets the level M and
paragraphs the level 1.

2.

Hierarchical
Indexing
with
propagation:
a. The number of occurrences:

top-down

LogOcc(t , d ) = ∑ occ(t , ndi ) * level (ndi )
i

b. The frequency

LogFreq (t , d ) = LogOcc(t , d ) / ∑ LogOcc(ti, d )
i

c. Apply TF-IDF for documents
LogWeight (t , d ) = LogFreq(t , d ) * log( N / Nt )

d. Apply iteratively the same formulae to
index nodes: define a threshold for index
propagation from a node to its children.

3.

Logical Structure Mining  a contextual network
of terms
–
The Ancestor-child Relation
∀ i ∈d , ∀ j ∈d ,
nd
nd
path( ndi, ndj ), level ( ndi ) > level ( ndj )
∀ i ∈ndi, ∀ j ∈ndj , ti ≠ tj ⇒
t
t
Sup (ti , tj ) = LogWeight (tj , ndj ) /(level ( ndi ) −level ( ndj ))
–

The Brother Relation
∀p ∈ d , level ( p ) = 1
∀ti ∈ p, ∀tj ∈ p , ti ≠ tj ⇒ sim(tj , ti ) = ( weight (ti, p ) + weight (tj , p )) / 2

4.

Semantic Relations Inferring
–

Taxonomy Construction
•
•

–

Select the root:

i
Build the other levels by using the “Sup” relation

RootWeight (t ) = ∑Sup (t , ti )

Similarity Measuring and Points of View Study
The LLR score

The contingency table

LLR (u, v) = -2 log(

L(O11, C1, r) * L(O12, C2, r)
)
L(O11, C1, r1) * L(O12, C2, r2)

t1=v

t1≠v

t2=u

O11

O12

where

T2≠u

O21

O22

L(k,n,r ) = k r * (1 −r ) n−k
r = R1 / N , r1 =O11 / C 1, r 2 =O12 / C 2

Experimentation and Results
• The corpus: 182 Web pages = 11.61 M.B
• Preparing the corpus: lemmatisation, stop words and
MWT handling: integrate a linguistic toolbox: (Bounhas
and Slimani, 2009b)

• Evaluation metrics: Precision, recall and F-measure in
1,2
1
0,8

Recall

0,6

Precision

0,4

F-measure

0,2
0
0

•

terms of the extracted relations.
The propagation threshold

0,2

0,4

0,6

0,8

‫‪Experimentation and Results‬‬
‫‪Hypernyms of‬‬
‫)‪ » (lion of the sea‬أسد البحر“‬

‫‪Hyponyms of‬‬
‫)‪” (amphibians‬برمائيات“‬

‫‪Terms ranked by‬‬
‫‪the RootWeight‬‬

‫‪Score‬‬

‫‪Term‬‬

‫‪Score‬‬

‫‪Term‬‬

‫‪RootWeight‬‬

‫‪Term‬‬

‫52,5‬

‫برمائيات‬

‫40,9‬

‫سمندجر‬

‫03,392‬

‫حيوان‬

‫80,2‬

‫حيوانات مفترسة‬

‫40,9‬

‫ضفدع‬

‫94,45‬

‫حبليات‬

‫20,0‬

‫حيوان‬

‫40,9‬

‫علجوم‬

‫08,05‬

‫حيوانات مائية‬

‫40,9‬

‫علجوم أخضر‬

‫77,64‬

‫حشرة‬

‫40,9‬

‫علجوم سوجرينام‬

‫14‬


‫40,9‬

‫فيل البحر‬

‫83‬

‫حيوانات برية‬

‫62,5‬

‫أسد البحر‬

‫23‬

‫حيوانات عاشبة‬

‫62,5‬

‫فقمة‬

‫62,52‬

‫حافريات‬

‫52‬

‫جرأسيات الجرجل‬

‫73,42‬

‫فقاجريات‬

‫96,22‬

‫ثانويات الفم‬

‫62,61‬

‫بقريات‬

‫51‬

‫ذوات القوقعة النابية‬
‫..…‬

‫‪Experimentation and Results‬‬
‫‪Similarity Measuring and Points of View Study‬‬





‫0‬

‫54,41‬

‫0‬

‫81,01‬

‫69,661‬


‫7,0‬

‫0‬

‫45,0‬

‫54,141‬

‫81,01‬


‫0‬

‫49,3‬

‫66,26‬

‫45,0‬

‫0‬


‫2,0‬

‫44,44‬

‫49,3‬

‫0‬

‫54,41‬


‫11,15‬

‫2,0‬

‫0‬

‫7,0‬

‫0‬


‫:‪• Two possible classifications‬‬
‫‪ (terrestrial‬حيوانات برية ,)‪ (marine animals‬حيوانات مائية –‬
‫)‪ (amphibians‬برمائيات ,)‪animals‬‬
‫)‪ (herbivores‬حيوانات عاشبة ,)‪ (carnivores‬حيوانات مفترسة –‬

Conclusion and future work
• Results:
– A model for semi-structured document
analysis and indexing
– An approach for structured based
terminology extraction and organisation
 a toolbox for document analysis and
knowledge extraction

• Future work
– Export knowledge as an ontology
– How to model queries?
– What to retrieve? a document, a section, a
paragraph,…,etc.
– How to evaluate results?

A hierarchical approach for semi structured document indexing and

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to A hierarchical approach for semi structured document indexing and

Similar to A hierarchical approach for semi structured document indexing and (20)

Recently uploaded

Recently uploaded (20)

A hierarchical approach for semi structured document indexing and