Cross-Language Information Retrieval (CLIR) systems extend classic information retrieval mechanisms for allowing users to query across languages, i.e., to retrieve documents written in languages different from the language used for query formulation.
In this paper, we present a CLIR system exploiting multilingual ontologies for enriching documents representation with multilingual semantic information during the indexing phase and for mapping query fragments to concepts during the retrieval phase.
This system has been applied on a domain-specific document collection and the contribution of the ontologies to the CLIR system has been evaluated in conjunction with the use of both Microsoft Bing and Google Translate translation services.
Results demonstrate that the use of domain-specific resources leads to a significant improvement of CLIR system performance.
Human Factors of XR: Using Human Factors to Design XR Systems
Using Semantic and Domain-based Information in CLIR Systems
1. Using Semantic and Domain-based
Information in CLIR Systems
Alessio Bosca2, Matteo Casu2, Chiara Di Francescomarino1,
Mauro Dragoni1
(1) Fondazione Bruno Kessler (FBK), Shape and Evolve Living Knowledge Unit (SHELL)
(2) CELI s.r.l.
https://shell.fbk.eu/index.php/Mauro_Dragoni - dragoni@fbk.eu
11th Extended Semantic Web Conference 2014 – May, 27th 2014
2. Background – CLIR: 3 Scenarios…
The document collection is monolingual, but users can formulate
queries in more than one language.
The document collection contains documents in multiple
languages and users can query the entire collection in one or
more languages.
The document collection contains documents with mixed-
language content and users can query the entire collection in one
or more languages.
3. Background – … and 2 strategies
Model dependent
Translation and retrieval are integrated in an uniform framework
Model independent
Translation and retrieval are treated as separated processes
4. Background - Challenges
Out-of-Vocabulary issue
improve the corpora used for training the machine translation model.
usage of domain information for increasing the coverage of the
dictionaries.
Usage of semantic artifacts for structuring the representation of
(multilingual) documents.
GOAL
to integrate domain-specific semantic knowledge within a
CLIR system and evaluate their effectiveness
5. Our Scenario
Use case: the agricultural domain
Knowledge resources: Agrovoc and Organic.Lingua ontologies
3 components used in the proposed approach:
Annotator
Indexer
Retriever
7. en
es
it
de
fr
….
Document content is used as query.
Between the candidate results, only “exact matches” are
considered.
Annotation Process – Step 2
9. Approach - Index
Given a document:
Text and annotations are extracted.
The context of each concept is retrieved from the ontologies.
Each contextual concepts are indexed with a weight proportional
w.r.t. their semantic distance from the semantic annotation.
Structure of each index record:
10. Approach - Retriever
Three retrieval configurations available:
Only translations: query terms are translated by using machine
translation services.
Semantic expansion by exploiting the domain ontology: query terms
are matched with ontology concepts; if an exact match exists, query
is expanded by using the URI of the concept and the URIs of the
contextual ones.
Ontology matching only: terms not having an exact match with
ontology concepts are discarded.
11. Evaluation - Setup
Collection of 13,000 multilingual documents.
48 queries originally provided in English and manually translated
in 12 languages under the supervision of both domain and
language experts.
Gold standard manually built by the domain experts.
MAP, Prec@5, Prec@10, Prec@20, Recall have been used.
14. Conclusions
The use of domain-specific ontologies lead to an improvement of CLIR
systems effectiveness.
Find the right trade-off between the effort of manually annotating
documents and the system effectiveness
Future work:
Improve the automatic annotation component
Move to a more complex semantic representation of information in order
to answer to more complex query.
References:
www.organic-edunet.eu: the portal
www.organic-lingua.eu/en/outcomes/deliverables: the data
Model-independent approaches treat translation and retrieval as two separate processes. The queries or the documents are first translated into the corresponding language of the documents or the queries. Monolingual IR models are then applied directly. A typical and also broadly used approach of this type is the machine translation (MT) approach which employs MT systems to translate the queries or documents before the monolingual retrieval process.
Model dependent methods integrate the translation and retrieval processes in a uniform framework. These methods, developed in the context of language models, have the advantage of accounting better for the uncertainty of translation during retrieval.
The main difference is that in the model independent approaches the result of the translation process is taken as it is, while in the model dependent, the uncertainty associated with each translation is considered during the retrieval phase. Therefore, the final rank considers it.
In this work we don’t focus on the study of a new statistical machine translation component to be integrated in CLIR systems, but on how to exploit available multilingual information in the domain-specific semantic knowledge used for enriching document representation and for improving retrieval effectiveness.
Obviously, we will also evaluate such effectiveness.
In the preliminary step, all labels are extracted from the ontologies and they are separately indexed by their languages.
In the second step, the content of each document is used as query over the different indexes that produces a set of ranks for each language present in the document. (non x tutte le lingue di cui ci sono labels).
The language contained in the document may be obtained in two different ways: (1) the textual content is tagged with the language code, or (2) a Language Identifier component is used.
Among the results, each label is searched within the document content, and only exact matches are considered as annotations.
The URIs of each accepted concept are put into the document representation and they are indexed together with the document content.
Here, some statistics about the automatic and manual annotations.
To notice the different size of the ontologies, as well as, the number of the manual annotations.
For each annotation, the “context” of each concept, identified by its parents and children are extracted from the ontology and they are indexed with a weight proportional to their semantic distance from the concept identified by the annotation.
In our case, the weight decreases …
The system implements three different configurations:
Query are translated by using available MT services (Google Translate and Microsoft Bing). The rational is the high coverage of their dictionaries (information about the coverage statistics)
Queries are expanded with URIs coming from the ontology: for each term, the system looks for an exact match in the ontology (in the same way of how the indexing process is done) and, if found, the query is expanded with the concept label. The match with the ontology is done by considering the query terms in their original language.
Queries are transformed in their semantic representation by considering only terms that have a match with the ontology. The goal of this configuration is to verify the sole impact of the ontology on the effectiveness of the system.
Queries have been created started by the analysis of query logs and by selecting them in order to avoid similar queries and for covering as many topics as possible.
Each query has been manually translated in each available language and each translation have been validated by language experts.
(Nel paper c’e’ scritto 8 lingue, e’ un errore oppure c’e’ un’altra motivazione?)
http://googletranslate.blogspot.it/2012/04/breaking-down-language-barriersix-years.html
http://research.microsoft.com/en-us/projects/mt/
- The baseline presents a high average recall => it means that the use of the freely available translation services doesn’t affect negatively the retrieval of relevant documents (OOV terms challenge)
- Agrovoc has a more coverage of the domain, and this is the reason for which we have a better improvement of the system effectiveness
- The use of the manual annotations significantly boosts the system effectiveness
- The use of a double weight doesn’t lead to further improvements of the system
Goal: verify the impact of the ontology on the documents retrieval
Observe the recall to notice that the use of ontologies it self doesn’t allow to retrieve a significant number of relevant documents
Not all queries can be transformed in their semantic representation
In general the performance are quite poor, this means that the overfitting between the document collection, the queries, and the ontologies is limited
Queries have been created started by the analysis of query logs and by selecting them in order to avoid similar queries and for covering as many topics as possible.
Each query has been manually translated in each available language and each translation have been validated by language experts.
(Nel paper c’e’ scritto 8 lingue, e’ un errore oppure c’e’ un’altra motivazione?)
http://googletranslate.blogspot.it/2012/04/breaking-down-language-barriersix-years.html
http://research.microsoft.com/en-us/projects/mt/