SlideShare a Scribd company logo
Using Semantic and Domain-based
Information in CLIR Systems
Alessio Bosca2, Matteo Casu2, Chiara Di Francescomarino1,
Mauro Dragoni1
(1) Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL)
(2) CELI s.r.l.
https://shell.fbk.eu/index.php/Mauro_Dragoni - dragoni@fbk.eu
11th Extended Semantic Web Conference 2014 – May, 27th 2014
Background – CLIR: 3 Scenarios…
 The document collection is monolingual, but users can formulate
queries in more than one language.
 The document collection contains documents in multiple
languages and users can query the entire collection in one or
more languages.
 The document collection contains documents with mixed-
language content and users can query the entire collection in one
or more languages.
Background – … and 2 strategies
 Model dependent
 Translation and retrieval are integrated in an uniform framework
 Model independent
 Translation and retrieval are treated as separated processes
Background - Challenges
 Out-of-Vocabulary issue
 improve the corpora used for training the machine translation model.
 usage of domain information for increasing the coverage of the
dictionaries.
 Usage of semantic artifacts for structuring the representation of
(multilingual) documents.
GOAL
to integrate domain-specific semantic knowledge within a
CLIR system and evaluate their effectiveness
Our Scenario
 Use case: the agricultural domain
 Knowledge resources: Agrovoc and Organic.Lingua ontologies
 3 components used in the proposed approach:
 Annotator
 Indexer
 Retriever
Annotation Process – Step 1
en
es
it
de
fr
….
en
es
it
de
fr
….
 Document content is used as query.
 Between the candidate results, only “exact matches” are
considered.
Annotation Process – Step 2
Approach – Annotation Stats
Domain
Ontology
Number of
Concepts
Manual
Annotations
Automatic
Annotations
Agrovoc (AV) 32061 0 133596
(5834 distinct
concepts used)
Organic.Lingua (OL) 291 27871
(264 distinct
concepts used)
16434
(208 distinct
concepts used)
Approach - Index
 Given a document:
 Text and annotations are extracted.
 The context of each concept is retrieved from the ontologies.
 Each contextual concepts are indexed with a weight proportional
w.r.t. their semantic distance from the semantic annotation.
 Structure of each index record:
Approach - Retriever
 Three retrieval configurations available:
 Only translations: query terms are translated by using machine
translation services.
 Semantic expansion by exploiting the domain ontology: query terms
are matched with ontology concepts; if an exact match exists, query
is expanded by using the URI of the concept and the URIs of the
contextual ones.
 Ontology matching only: terms not having an exact match with
ontology concepts are discarded.
Evaluation - Setup
 Collection of 13,000 multilingual documents.
 48 queries originally provided in English and manually translated
in 12 languages under the supervision of both domain and
language experts.
 Gold standard manually built by the domain experts.
 MAP, Prec@5, Prec@10, Prec@20, Recall have been used.
Results - 1
Avg. MAP Prec@5 Prec@10 Prec@20 Avg. Rec.
BASELINE 0.554 0.617 0.545 0.465 0.920
Auto: AV 3.24% 3.11% 5.04% 3.81% 2.52%
Auto: OL 2.31% 1.91% 2.88% 2.98% 0.77%
Auto: AV+OL 3.13% 2.95% 4.63% 3.86% 2.53%
Auto+Man: OL 1.65% 3.40% 3.95% 4.48% 1.37%
Auto+Man: AV+OL 4.38% 5.96% 7.18% 6.07% 2.97%
Auto+Man*2: OL 1.00% 3.30% 4.02% 3.27% 1.36%
Auto+Man*2: AV+OL 3.29% 4.86% 6.73% 6.03% 2.97%
Results - 2
Query
Cov.
Avg. MAP Prec@5 Prec@10 Prec@20 Avg. Rec.
AV 39.3
(9 langs)
0.137 0.189 0.191 0.179 0.552
OL 15.7
(10 langs)
0.260 0.359 0.319 0.322 0.635
AV + OL 33.3
(12 langs)
0.173 0.247 0.226 0.221 0.586
Conclusions
 The use of domain-specific ontologies lead to an improvement of CLIR
systems effectiveness.
 Find the right trade-off between the effort of manually annotating
documents and the system effectiveness
Future work:
 Improve the automatic annotation component
 Move to a more complex semantic representation of information in order
to answer to more complex query.
References:
 www.organic-edunet.eu: the portal
 www.organic-lingua.eu/en/outcomes/deliverables: the data
Mauro Dragoni
https://shell.fbk.eu/index.php/Mauro_Dragoni
dragoni@fbk.eu

More Related Content

Similar to Using Semantic and Domain-based Information in CLIR Systems

A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
cscpconf
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval System
IJTET Journal
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
iosrjce
 
D017232729
D017232729D017232729
D017232729
IOSR Journals
 
Knowledge Discovery in an Agents Environment
Knowledge Discovery in an Agents EnvironmentKnowledge Discovery in an Agents Environment
Knowledge Discovery in an Agents Environment
ManjulaPatel
 
Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...
AIMS (Agricultural Information Management Standards)
 
IRJET- Querying Database using Natural Language Interface
IRJET-  	  Querying Database using Natural Language InterfaceIRJET-  	  Querying Database using Natural Language Interface
IRJET- Querying Database using Natural Language Interface
IRJET Journal
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
ijcseit
 
Adcom2006 Full 6
Adcom2006 Full 6Adcom2006 Full 6
Adcom2006 Full 6umavanth
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
eSAT Publishing House
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
eMadrid network
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAM
ijcsa
 
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
vty
 
NeOn Project : Lifecycle support for Networked Ontologies
NeOn Project : Lifecycle support for Networked Ontologies NeOn Project : Lifecycle support for Networked Ontologies
NeOn Project : Lifecycle support for Networked Ontologies
AIMS (Agricultural Information Management Standards)
 
Reasoning on the Semantic Web
Reasoning on the Semantic WebReasoning on the Semantic Web
Reasoning on the Semantic Web
Yannis Kalfoglou
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
Stephen Marquard
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
 

Similar to Using Semantic and Domain-based Information in CLIR Systems (20)

A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRA NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIR
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval System
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
Knowledge Discovery in an Agents Environment
Knowledge Discovery in an Agents EnvironmentKnowledge Discovery in an Agents Environment
Knowledge Discovery in an Agents Environment
 
07 04-06
07 04-0607 04-06
07 04-06
 
Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...Knowledge Organization Systems (KOS): Management of Classification Systems in...
Knowledge Organization Systems (KOS): Management of Classification Systems in...
 
IRJET- Querying Database using Natural Language Interface
IRJET-  	  Querying Database using Natural Language InterfaceIRJET-  	  Querying Database using Natural Language Interface
IRJET- Querying Database using Natural Language Interface
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
 
Adcom2006 Full 6
Adcom2006 Full 6Adcom2006 Full 6
Adcom2006 Full 6
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAM
 
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in DataverseClariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataverse
 
NeOn project
NeOn projectNeOn project
NeOn project
 
NeOn Project : Lifecycle support for Networked Ontologies
NeOn Project : Lifecycle support for Networked Ontologies NeOn Project : Lifecycle support for Networked Ontologies
NeOn Project : Lifecycle support for Networked Ontologies
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
Reasoning on the Semantic Web
Reasoning on the Semantic WebReasoning on the Semantic Web
Reasoning on the Semantic Web
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 

More from Mauro Dragoni

Keynote given at ISWC 2019 Semantic Management for Healthcare Workshop
Keynote given at ISWC 2019 Semantic Management for Healthcare WorkshopKeynote given at ISWC 2019 Semantic Management for Healthcare Workshop
Keynote given at ISWC 2019 Semantic Management for Healthcare Workshop
Mauro Dragoni
 
Translating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsTranslating Ontologies in Real-World Settings
Translating Ontologies in Real-World Settings
Mauro Dragoni
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Mauro Dragoni
 
Exploiting Multilinguality For Creating Mappings Between Thesauri
Exploiting Multilinguality For Creating Mappings Between ThesauriExploiting Multilinguality For Creating Mappings Between Thesauri
Exploiting Multilinguality For Creating Mappings Between Thesauri
Mauro Dragoni
 
Semantic-based Process Analysis
Semantic-based Process AnalysisSemantic-based Process Analysis
Semantic-based Process Analysis
Mauro Dragoni
 
Authoring OWL 2 ontologies with the TEX-OWL syntax
Authoring OWL 2 ontologies with the TEX-OWL syntaxAuthoring OWL 2 ontologies with the TEX-OWL syntax
Authoring OWL 2 ontologies with the TEX-OWL syntax
Mauro Dragoni
 
A Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment AnalysisA Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment Analysis
Mauro Dragoni
 
Multilingual Knowledge Organization Systems Management: Best Practices
Multilingual Knowledge Organization Systems Management: Best PracticesMultilingual Knowledge Organization Systems Management: Best Practices
Multilingual Knowledge Organization Systems Management: Best Practices
Mauro Dragoni
 

More from Mauro Dragoni (8)

Keynote given at ISWC 2019 Semantic Management for Healthcare Workshop
Keynote given at ISWC 2019 Semantic Management for Healthcare WorkshopKeynote given at ISWC 2019 Semantic Management for Healthcare Workshop
Keynote given at ISWC 2019 Semantic Management for Healthcare Workshop
 
Translating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsTranslating Ontologies in Real-World Settings
Translating Ontologies in Real-World Settings
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Exploiting Multilinguality For Creating Mappings Between Thesauri
Exploiting Multilinguality For Creating Mappings Between ThesauriExploiting Multilinguality For Creating Mappings Between Thesauri
Exploiting Multilinguality For Creating Mappings Between Thesauri
 
Semantic-based Process Analysis
Semantic-based Process AnalysisSemantic-based Process Analysis
Semantic-based Process Analysis
 
Authoring OWL 2 ontologies with the TEX-OWL syntax
Authoring OWL 2 ontologies with the TEX-OWL syntaxAuthoring OWL 2 ontologies with the TEX-OWL syntax
Authoring OWL 2 ontologies with the TEX-OWL syntax
 
A Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment AnalysisA Fuzzy Approach For Multi-Domain Sentiment Analysis
A Fuzzy Approach For Multi-Domain Sentiment Analysis
 
Multilingual Knowledge Organization Systems Management: Best Practices
Multilingual Knowledge Organization Systems Management: Best PracticesMultilingual Knowledge Organization Systems Management: Best Practices
Multilingual Knowledge Organization Systems Management: Best Practices
 

Recently uploaded

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Using Semantic and Domain-based Information in CLIR Systems

  • 1. Using Semantic and Domain-based Information in CLIR Systems Alessio Bosca2, Matteo Casu2, Chiara Di Francescomarino1, Mauro Dragoni1 (1) Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL) (2) CELI s.r.l. https://shell.fbk.eu/index.php/Mauro_Dragoni - dragoni@fbk.eu 11th Extended Semantic Web Conference 2014 – May, 27th 2014
  • 2. Background – CLIR: 3 Scenarios…  The document collection is monolingual, but users can formulate queries in more than one language.  The document collection contains documents in multiple languages and users can query the entire collection in one or more languages.  The document collection contains documents with mixed- language content and users can query the entire collection in one or more languages.
  • 3. Background – … and 2 strategies  Model dependent  Translation and retrieval are integrated in an uniform framework  Model independent  Translation and retrieval are treated as separated processes
  • 4. Background - Challenges  Out-of-Vocabulary issue  improve the corpora used for training the machine translation model.  usage of domain information for increasing the coverage of the dictionaries.  Usage of semantic artifacts for structuring the representation of (multilingual) documents. GOAL to integrate domain-specific semantic knowledge within a CLIR system and evaluate their effectiveness
  • 5. Our Scenario  Use case: the agricultural domain  Knowledge resources: Agrovoc and Organic.Lingua ontologies  3 components used in the proposed approach:  Annotator  Indexer  Retriever
  • 6. Annotation Process – Step 1 en es it de fr ….
  • 7. en es it de fr ….  Document content is used as query.  Between the candidate results, only “exact matches” are considered. Annotation Process – Step 2
  • 8. Approach – Annotation Stats Domain Ontology Number of Concepts Manual Annotations Automatic Annotations Agrovoc (AV) 32061 0 133596 (5834 distinct concepts used) Organic.Lingua (OL) 291 27871 (264 distinct concepts used) 16434 (208 distinct concepts used)
  • 9. Approach - Index  Given a document:  Text and annotations are extracted.  The context of each concept is retrieved from the ontologies.  Each contextual concepts are indexed with a weight proportional w.r.t. their semantic distance from the semantic annotation.  Structure of each index record:
  • 10. Approach - Retriever  Three retrieval configurations available:  Only translations: query terms are translated by using machine translation services.  Semantic expansion by exploiting the domain ontology: query terms are matched with ontology concepts; if an exact match exists, query is expanded by using the URI of the concept and the URIs of the contextual ones.  Ontology matching only: terms not having an exact match with ontology concepts are discarded.
  • 11. Evaluation - Setup  Collection of 13,000 multilingual documents.  48 queries originally provided in English and manually translated in 12 languages under the supervision of both domain and language experts.  Gold standard manually built by the domain experts.  MAP, Prec@5, Prec@10, Prec@20, Recall have been used.
  • 12. Results - 1 Avg. MAP Prec@5 Prec@10 Prec@20 Avg. Rec. BASELINE 0.554 0.617 0.545 0.465 0.920 Auto: AV 3.24% 3.11% 5.04% 3.81% 2.52% Auto: OL 2.31% 1.91% 2.88% 2.98% 0.77% Auto: AV+OL 3.13% 2.95% 4.63% 3.86% 2.53% Auto+Man: OL 1.65% 3.40% 3.95% 4.48% 1.37% Auto+Man: AV+OL 4.38% 5.96% 7.18% 6.07% 2.97% Auto+Man*2: OL 1.00% 3.30% 4.02% 3.27% 1.36% Auto+Man*2: AV+OL 3.29% 4.86% 6.73% 6.03% 2.97%
  • 13. Results - 2 Query Cov. Avg. MAP Prec@5 Prec@10 Prec@20 Avg. Rec. AV 39.3 (9 langs) 0.137 0.189 0.191 0.179 0.552 OL 15.7 (10 langs) 0.260 0.359 0.319 0.322 0.635 AV + OL 33.3 (12 langs) 0.173 0.247 0.226 0.221 0.586
  • 14. Conclusions  The use of domain-specific ontologies lead to an improvement of CLIR systems effectiveness.  Find the right trade-off between the effort of manually annotating documents and the system effectiveness Future work:  Improve the automatic annotation component  Move to a more complex semantic representation of information in order to answer to more complex query. References:  www.organic-edunet.eu: the portal  www.organic-lingua.eu/en/outcomes/deliverables: the data

Editor's Notes

  1. Model-independent approaches treat translation and retrieval as two separate processes. The queries or the documents are first translated into the corresponding language of the documents or the queries. Monolingual IR models are then applied directly. A typical and also broadly used approach of this type is the machine translation (MT) approach which employs MT systems to translate the queries or documents before the monolingual retrieval process. Model dependent methods integrate the translation and retrieval processes in a uniform framework. These methods, developed in the context of language models, have the advantage of accounting better for the uncertainty of translation during retrieval. The main difference is that in the model independent approaches the result of the translation process is taken as it is, while in the model dependent, the uncertainty associated with each translation is considered during the retrieval phase. Therefore, the final rank considers it.
  2. In this work we don’t focus on the study of a new statistical machine translation component to be integrated in CLIR systems, but on how to exploit available multilingual information in the domain-specific semantic knowledge used for enriching document representation and for improving retrieval effectiveness. Obviously, we will also evaluate such effectiveness.
  3. In the preliminary step, all labels are extracted from the ontologies and they are separately indexed by their languages.
  4. In the second step, the content of each document is used as query over the different indexes that produces a set of ranks for each language present in the document. (non x tutte le lingue di cui ci sono labels). The language contained in the document may be obtained in two different ways: (1) the textual content is tagged with the language code, or (2) a Language Identifier component is used. Among the results, each label is searched within the document content, and only exact matches are considered as annotations. The URIs of each accepted concept are put into the document representation and they are indexed together with the document content.
  5. Here, some statistics about the automatic and manual annotations. To notice the different size of the ontologies, as well as, the number of the manual annotations.
  6. For each annotation, the “context” of each concept, identified by its parents and children are extracted from the ontology and they are indexed with a weight proportional to their semantic distance from the concept identified by the annotation. In our case, the weight decreases …
  7. The system implements three different configurations: Query are translated by using available MT services (Google Translate and Microsoft Bing). The rational is the high coverage of their dictionaries (information about the coverage statistics) Queries are expanded with URIs coming from the ontology: for each term, the system looks for an exact match in the ontology (in the same way of how the indexing process is done) and, if found, the query is expanded with the concept label. The match with the ontology is done by considering the query terms in their original language. Queries are transformed in their semantic representation by considering only terms that have a match with the ontology. The goal of this configuration is to verify the sole impact of the ontology on the effectiveness of the system.
  8. Queries have been created started by the analysis of query logs and by selecting them in order to avoid similar queries and for covering as many topics as possible. Each query has been manually translated in each available language and each translation have been validated by language experts. (Nel paper c’e’ scritto 8 lingue, e’ un errore oppure c’e’ un’altra motivazione?) http://googletranslate.blogspot.it/2012/04/breaking-down-language-barriersix-years.html http://research.microsoft.com/en-us/projects/mt/
  9. - The baseline presents a high average recall => it means that the use of the freely available translation services doesn’t affect negatively the retrieval of relevant documents (OOV terms challenge) - Agrovoc has a more coverage of the domain, and this is the reason for which we have a better improvement of the system effectiveness - The use of the manual annotations significantly boosts the system effectiveness - The use of a double weight doesn’t lead to further improvements of the system
  10. Goal: verify the impact of the ontology on the documents retrieval Observe the recall to notice that the use of ontologies it self doesn’t allow to retrieve a significant number of relevant documents Not all queries can be transformed in their semantic representation In general the performance are quite poor, this means that the overfitting between the document collection, the queries, and the ontologies is limited
  11. Queries have been created started by the analysis of query logs and by selecting them in order to avoid similar queries and for covering as many topics as possible. Each query has been manually translated in each available language and each translation have been validated by language experts. (Nel paper c’e’ scritto 8 lingue, e’ un errore oppure c’e’ un’altra motivazione?) http://googletranslate.blogspot.it/2012/04/breaking-down-language-barriersix-years.html http://research.microsoft.com/en-us/projects/mt/