SlideShare a Scribd company logo
University of Tunis El Manar, Faculty of sciences of Tunis

A hierarchical Approach for Semi-Structured
Document Indexing and Terminology
Extraction

By Ibrahim Bounhas & Yahya Slimani
Outline
•
•
•
–
–
–
–
•

Introduction
Related work
The proposed approach
Logical Structure Extraction
Hierarchical Indexing
Logical Structure Mining
Semantic Relations Inferring
Experimentation and Results
Introduction
•

Structured information retrieval:
Documents have a structure (e.g. papers,
books, encyclopaedia):
1. How to index documents and fragments?
2. How to exploit the structure for terminology
extraction and organisation?
3. How to model queries?
4. What to retrieve? a document, a section, a
paragraph,…,etc.
5. How to evaluate results?
Related work
• Method of indexing
– Disjoint units: ignore the structure of documents
 Unable to characterize and retrieve internal
nodes.
– Path indexing: Path indexing consists in identifying
terms that discriminate each path
 Unable to characterize and retrieve internal
nodes
 Unable to discriminate nodes having the same
path and different content.
– Bottom-up
index
propagation:
terms
are
propagated from child nodes to their parents using
statistical measures (e.g. tf-idf, entropy)
 Unable to distinguish terms which index
composite nodes having the same level.
 Terms weighting does not take into account levels
of text nodes
Type of Structural Information
• Type of Structural Information
– Physical structure: physical attributes constitute the
context of text nodes: e.g. HTML or XHTML tags. 
tags are very ambiguous: the same tag may have
different usages for different authors.
– Content tags: XML tags describe the logical role of
each node  logical tags does not characterize
semantically the content.
– Logical structure: the hierarchy of titles of the
document is explicitly represented.
Structure Based Knowledge Extraction
• Existing approaches:
– Extract instances of the concepts of a given ontology
(Kuo et al, 2006)
 Need an existing ontology + highly structured
documents.
– Concept extraction based on HTML tags (Kruschwitz,
2005; Karoui, 2008)
 No relations between concepts.
– Co-hyponyms detection from XHTML pages (Brunzel
and Spiliopoulou, 2006)
 No vertical relations
The proposed approach
1.

Logical Structure Extraction (Bounhas and
Slimani, 2009a):
–

Physical analyser: identify physical blocks and font
styles.
–
Macro-logical analyser: attribute a level to each text
block.
 The title of the document gets the level M and
paragraphs the level 1.
The proposed approach
2.

Hierarchical
Indexing
with
propagation:
a. The number of occurrences:

top-down

LogOcc(t , d ) = ∑ occ(t , ndi ) * level (ndi )
i

b. The frequency

LogFreq (t , d ) = LogOcc(t , d ) / ∑ LogOcc(ti, d )
i

c. Apply TF-IDF for documents
LogWeight (t , d ) = LogFreq(t , d ) * log( N / Nt )

d. Apply iteratively the same formulae to
index nodes: define a threshold for index
propagation from a node to its children.
The proposed approach
3.

Logical Structure Mining  a contextual network
of terms
–
The Ancestor-child Relation
∀ i ∈d , ∀ j ∈d ,
nd
nd
path( ndi, ndj ), level ( ndi ) > level ( ndj )
∀ i ∈ndi, ∀ j ∈ndj , ti ≠ tj ⇒
t
t
Sup (ti , tj ) = LogWeight (tj , ndj ) /(level ( ndi ) −level ( ndj ))
–

The Brother Relation
∀p ∈ d , level ( p ) = 1
∀ti ∈ p, ∀tj ∈ p , ti ≠ tj ⇒ sim(tj , ti ) = ( weight (ti, p ) + weight (tj , p )) / 2
The proposed approach
4.

Semantic Relations Inferring
–

Taxonomy Construction
•
•

–

Select the root:

i
Build the other levels by using the “Sup” relation

RootWeight (t ) = ∑Sup (t , ti )

Similarity Measuring and Points of View Study
The LLR score

The contingency table

LLR (u, v) = -2 log(

L(O11, C1, r) * L(O12, C2, r)
)
L(O11, C1, r1) * L(O12, C2, r2)

t1=v

t1≠v

t2=u

O11

O12

where

T2≠u

O21

O22

L(k,n,r ) = k r * (1 −r ) n−k
r = R1 / N , r1 =O11 / C 1, r 2 =O12 / C 2
Experimentation and Results
• The corpus: 182 Web pages = 11.61 M.B
• Preparing the corpus: lemmatisation, stop words and
MWT handling: integrate a linguistic toolbox: (Bounhas
and Slimani, 2009b)

• Evaluation metrics: Precision, recall and F-measure in
1,2
1
0,8

Recall

0,6

Precision

0,4

F-measure

0,2
0
0

•

terms of the extracted relations.
The propagation threshold

0,2

0,4

0,6

0,8
‫‪Experimentation and Results‬‬
‫‪Hypernyms of‬‬
‫)‪ » (lion of the sea‬أسد البحر“‬

‫‪Hyponyms of‬‬
‫)‪” (amphibians‬برمائيات“‬

‫‪Terms ranked by‬‬
‫‪the RootWeight‬‬

‫‪Score‬‬

‫‪Term‬‬

‫‪Score‬‬

‫‪Term‬‬

‫‪RootWeight‬‬

‫‪Term‬‬

‫52,5‬

‫برمائيات‬

‫40,9‬

‫سمندجر‬

‫03,392‬

‫حيوان‬

‫80,2‬

‫حيوانات مفترسة‬

‫40,9‬

‫ضفدع‬

‫94,45‬

‫حبليات‬

‫20,0‬

‫حيوان‬

‫40,9‬

‫علجوم‬

‫08,05‬

‫حيوانات مائية‬

‫40,9‬

‫علجوم أخضر‬

‫77,64‬

‫حشرة‬

‫40,9‬

‫علجوم سوجرينام‬

‫14‬

‫حيوانات مفترسة‬

‫40,9‬

‫فيل البحر‬

‫83‬

‫حيوانات برية‬

‫62,5‬

‫أسد البحر‬

‫23‬

‫حيوانات عاشبة‬

‫62,5‬

‫فقمة‬

‫62,52‬

‫حافريات‬

‫52‬

‫جرأسيات الجرجل‬

‫73,42‬

‫فقاجريات‬

‫96,22‬

‫ثانويات الفم‬

‫62,61‬

‫بقريات‬

‫51‬

‫ذوات القوقعة النابية‬
‫..…‬
‫‪Experimentation and Results‬‬
‫‪Similarity Measuring and Points of View Study‬‬
‫برمائيات‬

‫حيوانات عاشبة‬

‫حيوانات برية‬

‫حيوانات مفترسة‬

‫حيوانات مائية‬

‫0‬

‫54,41‬

‫0‬

‫81,01‬

‫69,661‬

‫حيوانات مائية‬

‫7,0‬

‫0‬

‫45,0‬

‫54,141‬

‫81,01‬

‫حيوانات مفترسة‬

‫0‬

‫49,3‬

‫66,26‬

‫45,0‬

‫0‬

‫حيوانات برية‬

‫2,0‬

‫44,44‬

‫49,3‬

‫0‬

‫54,41‬

‫حيوانات عاشبة‬

‫11,15‬

‫2,0‬

‫0‬

‫7,0‬

‫0‬

‫برمائيات‬

‫:‪• Two possible classifications‬‬
‫‪ (terrestrial‬حيوانات برية ,)‪ (marine animals‬حيوانات مائية –‬
‫)‪ (amphibians‬برمائيات ,)‪animals‬‬
‫)‪ (herbivores‬حيوانات عاشبة ,)‪ (carnivores‬حيوانات مفترسة –‬
Conclusion and future work
• Results:
– A model for semi-structured document
analysis and indexing
– An approach for structured based
terminology extraction and organisation
 a toolbox for document analysis and
knowledge extraction

• Future work
– Export knowledge as an ontology
– How to model queries?
– What to retrieve? a document, a section, a
paragraph,…,etc.
– How to evaluate results?

More Related Content

What's hot

`deep' semantics in the geosciences: semantic building blocks for a complete ...
`deep' semantics in the geosciences: semantic building blocks for a complete ...`deep' semantics in the geosciences: semantic building blocks for a complete ...
`deep' semantics in the geosciences: semantic building blocks for a complete ...
the university of auckland
 
C Omega
C OmegaC Omega
C Omega
iradarji
 
The Maze of Deletion in Ontology Stream Reasoning
The Maze of Deletion in Ontology Stream Reasoning The Maze of Deletion in Ontology Stream Reasoning
The Maze of Deletion in Ontology Stream Reasoning
Jeff Z. Pan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
Dmitry Kan
 
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
Data Structures and Algorithm - Week 4 - Trees, Binary TreesData Structures and Algorithm - Week 4 - Trees, Binary Trees
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
Ferdin Joe John Joseph PhD
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
Besnik Fetahu
 
Week 1 - Data Structures and Algorithms
Week 1 - Data Structures and AlgorithmsWeek 1 - Data Structures and Algorithms
Week 1 - Data Structures and Algorithms
Ferdin Joe John Joseph PhD
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
Kriti Khanna
 
Hierarchical Dirichlet Process
Hierarchical Dirichlet ProcessHierarchical Dirichlet Process
Hierarchical Dirichlet Process
Sangwoo Mo
 
Week 2 - Data Structures and Algorithms
Week 2 - Data Structures and AlgorithmsWeek 2 - Data Structures and Algorithms
Week 2 - Data Structures and Algorithms
Ferdin Joe John Joseph PhD
 
Csc307
Csc307Csc307

What's hot (12)

`deep' semantics in the geosciences: semantic building blocks for a complete ...
`deep' semantics in the geosciences: semantic building blocks for a complete ...`deep' semantics in the geosciences: semantic building blocks for a complete ...
`deep' semantics in the geosciences: semantic building blocks for a complete ...
 
C Omega
C OmegaC Omega
C Omega
 
Quiz
QuizQuiz
Quiz
 
The Maze of Deletion in Ontology Stream Reasoning
The Maze of Deletion in Ontology Stream Reasoning The Maze of Deletion in Ontology Stream Reasoning
The Maze of Deletion in Ontology Stream Reasoning
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
Data Structures and Algorithm - Week 4 - Trees, Binary TreesData Structures and Algorithm - Week 4 - Trees, Binary Trees
Data Structures and Algorithm - Week 4 - Trees, Binary Trees
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
Week 1 - Data Structures and Algorithms
Week 1 - Data Structures and AlgorithmsWeek 1 - Data Structures and Algorithms
Week 1 - Data Structures and Algorithms
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
 
Hierarchical Dirichlet Process
Hierarchical Dirichlet ProcessHierarchical Dirichlet Process
Hierarchical Dirichlet Process
 
Week 2 - Data Structures and Algorithms
Week 2 - Data Structures and AlgorithmsWeek 2 - Data Structures and Algorithms
Week 2 - Data Structures and Algorithms
 
Csc307
Csc307Csc307
Csc307
 

Similar to A hierarchical approach for semi structured document indexing and

Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Discovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature MappingDiscovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature Mapping
Grial - University of Salamanca
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Khirulnizam Abd Rahman
 
A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries. A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries.
Technological Ecosystems for Enhancing Multiculturality
 
An Automatic Question Paper Generation : Using Bloom's Taxonomy
An Automatic Question Paper Generation : Using Bloom's   TaxonomyAn Automatic Question Paper Generation : Using Bloom's   Taxonomy
An Automatic Question Paper Generation : Using Bloom's Taxonomy
IRJET Journal
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Francesco Osborne
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
Jeff Z. Pan
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mapping
samhati27
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
Keerti Bhogaraju
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
Francesco Osborne
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
National Institute of Informatics
 
Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...
Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...
Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...
Università degli Studi di Modena e Reggio Emilia/Tallinn University
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Jennifer D'Souza
 
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
cscpconf
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
csandit
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Guestion paper
Guestion paperGuestion paper
Guestion paper
Ezhilarasan Elumalai
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
Lisa Graves
 
Part 1 Research workshop
Part 1 Research workshopPart 1 Research workshop
Part 1 Research workshop
Researchworkshop
 

Similar to A hierarchical approach for semi structured document indexing and (20)

Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Discovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature MappingDiscovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature Mapping
 
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
 
A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries. A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries.
 
An Automatic Question Paper Generation : Using Bloom's Taxonomy
An Automatic Question Paper Generation : Using Bloom's   TaxonomyAn Automatic Question Paper Generation : Using Bloom's   Taxonomy
An Automatic Question Paper Generation : Using Bloom's Taxonomy
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
The Rise of Approximate Ontology Reasoning: Is It Mainstream Yet? --- Revisit...
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mapping
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
 
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
 
Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...
Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...
Learning Interactions Across Spaces: a Framework for Contextualised Multimoda...
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Guestion paper
Guestion paperGuestion paper
Guestion paper
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
Part 1 Research workshop
Part 1 Research workshopPart 1 Research workshop
Part 1 Research workshop
 

Recently uploaded

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 

Recently uploaded (20)

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 

A hierarchical approach for semi structured document indexing and

  • 1. University of Tunis El Manar, Faculty of sciences of Tunis A hierarchical Approach for Semi-Structured Document Indexing and Terminology Extraction By Ibrahim Bounhas & Yahya Slimani
  • 2. Outline • • • – – – – • Introduction Related work The proposed approach Logical Structure Extraction Hierarchical Indexing Logical Structure Mining Semantic Relations Inferring Experimentation and Results
  • 3. Introduction • Structured information retrieval: Documents have a structure (e.g. papers, books, encyclopaedia): 1. How to index documents and fragments? 2. How to exploit the structure for terminology extraction and organisation? 3. How to model queries? 4. What to retrieve? a document, a section, a paragraph,…,etc. 5. How to evaluate results?
  • 4. Related work • Method of indexing – Disjoint units: ignore the structure of documents  Unable to characterize and retrieve internal nodes. – Path indexing: Path indexing consists in identifying terms that discriminate each path  Unable to characterize and retrieve internal nodes  Unable to discriminate nodes having the same path and different content. – Bottom-up index propagation: terms are propagated from child nodes to their parents using statistical measures (e.g. tf-idf, entropy)  Unable to distinguish terms which index composite nodes having the same level.  Terms weighting does not take into account levels of text nodes
  • 5. Type of Structural Information • Type of Structural Information – Physical structure: physical attributes constitute the context of text nodes: e.g. HTML or XHTML tags.  tags are very ambiguous: the same tag may have different usages for different authors. – Content tags: XML tags describe the logical role of each node  logical tags does not characterize semantically the content. – Logical structure: the hierarchy of titles of the document is explicitly represented.
  • 6. Structure Based Knowledge Extraction • Existing approaches: – Extract instances of the concepts of a given ontology (Kuo et al, 2006)  Need an existing ontology + highly structured documents. – Concept extraction based on HTML tags (Kruschwitz, 2005; Karoui, 2008)  No relations between concepts. – Co-hyponyms detection from XHTML pages (Brunzel and Spiliopoulou, 2006)  No vertical relations
  • 7. The proposed approach 1. Logical Structure Extraction (Bounhas and Slimani, 2009a): – Physical analyser: identify physical blocks and font styles. – Macro-logical analyser: attribute a level to each text block.  The title of the document gets the level M and paragraphs the level 1.
  • 8. The proposed approach 2. Hierarchical Indexing with propagation: a. The number of occurrences: top-down LogOcc(t , d ) = ∑ occ(t , ndi ) * level (ndi ) i b. The frequency LogFreq (t , d ) = LogOcc(t , d ) / ∑ LogOcc(ti, d ) i c. Apply TF-IDF for documents LogWeight (t , d ) = LogFreq(t , d ) * log( N / Nt ) d. Apply iteratively the same formulae to index nodes: define a threshold for index propagation from a node to its children.
  • 9. The proposed approach 3. Logical Structure Mining  a contextual network of terms – The Ancestor-child Relation ∀ i ∈d , ∀ j ∈d , nd nd path( ndi, ndj ), level ( ndi ) > level ( ndj ) ∀ i ∈ndi, ∀ j ∈ndj , ti ≠ tj ⇒ t t Sup (ti , tj ) = LogWeight (tj , ndj ) /(level ( ndi ) −level ( ndj )) – The Brother Relation ∀p ∈ d , level ( p ) = 1 ∀ti ∈ p, ∀tj ∈ p , ti ≠ tj ⇒ sim(tj , ti ) = ( weight (ti, p ) + weight (tj , p )) / 2
  • 10. The proposed approach 4. Semantic Relations Inferring – Taxonomy Construction • • – Select the root: i Build the other levels by using the “Sup” relation RootWeight (t ) = ∑Sup (t , ti ) Similarity Measuring and Points of View Study The LLR score The contingency table LLR (u, v) = -2 log( L(O11, C1, r) * L(O12, C2, r) ) L(O11, C1, r1) * L(O12, C2, r2) t1=v t1≠v t2=u O11 O12 where T2≠u O21 O22 L(k,n,r ) = k r * (1 −r ) n−k r = R1 / N , r1 =O11 / C 1, r 2 =O12 / C 2
  • 11. Experimentation and Results • The corpus: 182 Web pages = 11.61 M.B • Preparing the corpus: lemmatisation, stop words and MWT handling: integrate a linguistic toolbox: (Bounhas and Slimani, 2009b) • Evaluation metrics: Precision, recall and F-measure in 1,2 1 0,8 Recall 0,6 Precision 0,4 F-measure 0,2 0 0 • terms of the extracted relations. The propagation threshold 0,2 0,4 0,6 0,8
  • 12. ‫‪Experimentation and Results‬‬ ‫‪Hypernyms of‬‬ ‫)‪ » (lion of the sea‬أسد البحر“‬ ‫‪Hyponyms of‬‬ ‫)‪” (amphibians‬برمائيات“‬ ‫‪Terms ranked by‬‬ ‫‪the RootWeight‬‬ ‫‪Score‬‬ ‫‪Term‬‬ ‫‪Score‬‬ ‫‪Term‬‬ ‫‪RootWeight‬‬ ‫‪Term‬‬ ‫52,5‬ ‫برمائيات‬ ‫40,9‬ ‫سمندجر‬ ‫03,392‬ ‫حيوان‬ ‫80,2‬ ‫حيوانات مفترسة‬ ‫40,9‬ ‫ضفدع‬ ‫94,45‬ ‫حبليات‬ ‫20,0‬ ‫حيوان‬ ‫40,9‬ ‫علجوم‬ ‫08,05‬ ‫حيوانات مائية‬ ‫40,9‬ ‫علجوم أخضر‬ ‫77,64‬ ‫حشرة‬ ‫40,9‬ ‫علجوم سوجرينام‬ ‫14‬ ‫حيوانات مفترسة‬ ‫40,9‬ ‫فيل البحر‬ ‫83‬ ‫حيوانات برية‬ ‫62,5‬ ‫أسد البحر‬ ‫23‬ ‫حيوانات عاشبة‬ ‫62,5‬ ‫فقمة‬ ‫62,52‬ ‫حافريات‬ ‫52‬ ‫جرأسيات الجرجل‬ ‫73,42‬ ‫فقاجريات‬ ‫96,22‬ ‫ثانويات الفم‬ ‫62,61‬ ‫بقريات‬ ‫51‬ ‫ذوات القوقعة النابية‬ ‫..…‬
  • 13. ‫‪Experimentation and Results‬‬ ‫‪Similarity Measuring and Points of View Study‬‬ ‫برمائيات‬ ‫حيوانات عاشبة‬ ‫حيوانات برية‬ ‫حيوانات مفترسة‬ ‫حيوانات مائية‬ ‫0‬ ‫54,41‬ ‫0‬ ‫81,01‬ ‫69,661‬ ‫حيوانات مائية‬ ‫7,0‬ ‫0‬ ‫45,0‬ ‫54,141‬ ‫81,01‬ ‫حيوانات مفترسة‬ ‫0‬ ‫49,3‬ ‫66,26‬ ‫45,0‬ ‫0‬ ‫حيوانات برية‬ ‫2,0‬ ‫44,44‬ ‫49,3‬ ‫0‬ ‫54,41‬ ‫حيوانات عاشبة‬ ‫11,15‬ ‫2,0‬ ‫0‬ ‫7,0‬ ‫0‬ ‫برمائيات‬ ‫:‪• Two possible classifications‬‬ ‫‪ (terrestrial‬حيوانات برية ,)‪ (marine animals‬حيوانات مائية –‬ ‫)‪ (amphibians‬برمائيات ,)‪animals‬‬ ‫)‪ (herbivores‬حيوانات عاشبة ,)‪ (carnivores‬حيوانات مفترسة –‬
  • 14. Conclusion and future work • Results: – A model for semi-structured document analysis and indexing – An approach for structured based terminology extraction and organisation  a toolbox for document analysis and knowledge extraction • Future work – Export knowledge as an ontology – How to model queries? – What to retrieve? a document, a section, a paragraph,…,etc. – How to evaluate results?