1/37
Cross-lingual information retrieval
Shadi Saleh
Institute of Formal and Applied Linguistics
Charles University
saleh@ufal.mff.cuni.cz
27 Nov. 2017
2/37
Information Retrieval Task
Definition
Information retrieval (IR) is finding material (usually doc-
uments) of an unstructured nature (usually text) that satis-
fies an information need from within large collections (usually
stored on computers).
source: trec.nist.gov
3/37
Information Retrieval Task
3/37
Information Retrieval Task
4/37
Information Retrieval Task
Heat-map test (golden triangle) is done by Enquiro, Eyetools, and
Didit with search engine users.
5/37
Monolingual IR system structure
6/37
IR evaluation
IR system returns ranked list of documents (scored by degree
of relevance)
Users are interested in the top k documents
Development:
Set of documents
Set of training/test queries
Metric: P@10, Percent of relevant documents among the
highest 10 retrieved ones
How to judge relevant/irrelevant documents? Assessment
process
7/37
Data & tools
CLEF eHealth 2015 IR task document collection (corpus)
For searching, queries from CLEF eHealth IR tasks
2013–2015, 166 queries in total
Queries were created in 2013 and 2014 by medical experts
In 2015, queries were created to simulate the way laypeople
write queries
Randomly split into 100 queries for training, 66 for test
Relevance assessment is done by medical experts
8/37
Sample query: CLEF 2013
<t o p i c>
<id>qtest4</ id>
< t i t l e>nausea and vomiting and
hematemesis</ t i t l e>
<desc>What are nausea , vomiting and
hematemesis</ desc>
<narr>What i s the connection with nausea ,
vomiting and hematemesis</ narr>
<p r o f i l e>A 64−year old emigrant who i s not
sure what nausea , vomiting and hematemesis
mean in h i s d i s c h a r g e summary</ p r o f i l e>
</ t o p i c>
9/37
Sample queries: CLEF 2015
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .9</ id>
< t i t l e>red i t c h y eyes</ t i t l e>
</ t o p i c>
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .16</ id>
< t i t l e>red patchy b r u i s i n g over l e g s</ t i t l e>
</ t o p i c>
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .44</ id>
< t i t l e>n a i l g e t t i n g dark</ t i t l e>
</ t o p i c>
10/37
Assessment process
10/37
Assessment process
10/37
Assessment process
11/37
Monolingual experiment
Indexing and searching is done using Terrier (an open source
IR system) 1
Set of tuning experiments
P@10: 47.10 (training set) and 50.30 (test set)
1
http://terrier.org
12/37
Cross-lingual IR problem
Definition
Cross Lingual Information Retrieval provides allows a user to
ask a query in native language and then to get the document
in different language.
Czech query
Query: nevolnost a zvracen´ı a hematemeze?
13/37
Cross-lingual IR approaches: query translation
Index
Documents (EN) User poses a query (CS)
Indexer
Top-K Retrieval system
Ranked list of documents
MT system
EN query
Reducing CLIR task into monolingual task
14/37
Cross-lingual data
166 queries in English were translated by native medical
experts into (Czech, French, German, Hungarian, Polish,
Spanish, Swedish)
Task is reduced into Monolingual IR: Same relevance data
15/37
Query translation experiment
Translate queries in all languages into collection language
using online public MT systems:
Google Translate
Bing translator
Sys Czech French German Hungarian Polish Spanish Swedish
Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30
Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48
Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
16/37
Baseline CLIR system
Translate queries into English using SMT systems, developed
by colleagues at UFAL
Trained to translate search queries (medical domain)
Returns list of alternative translation (N-best-list)
Sys Czech French German Hungarian Polish Spanish Swedish
Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30
Baseline 45.76 47.88 42.58 40.76 36.82 44.09 36.67
Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48
Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
17/37
Reranking approach
Motivation
The single best translation that is returned by SMT system
is not selected w.r.t CLIR performance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Czech French German
01020304050
Histograms of ranks of translation hypotheses with the highest P@10 for
each training query
18/37
Reranking approach
Trained to select the best translation for CLIR performance
P@10 as an objective function (predict the translation that
gives the highest P@10)
Index
Documents (EN)
nevolnost a zvracení a hematemeze
Indexer
Top-K Retrieval system
Ranked list of documents
MT system
N-best-list (EN)
Reranker
EN query
19/37
Feature set
SMT scores: Translation model, language model and
reordering models
Rank features: SMT rank and a Boolean feature (1 for best
rank, 0 otherwise)
Features based on Blind relevance feedback
IDF from the collection (inverse document frequency)
Translation pool
Retrieval statue value
Features that are based on external resources (UMLS1,
Wikipedia)
1
The Unified Medical Language System: large, multi-purpose, and
multi-lingual thesaurus that contains millions of biomedical and health related
concepts
20/37
Training
100 queries for training, 15-best-list hypotheses for each query.
Two approaches for training:
Language-Specific: Model for each language
Language-Independent: One model for all languages
Leave-One-Out cross validation
21/37
Reranker testing
Generate vectors of feature values for each query
The trained regression model predicts the hypothesis that
gives that highest P@10
Run retrieval for that hypothesis query string
22/37
Results - test set
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
Reranker 50.15 51.06 45.30
Google 50.91 49.70 49.39
Bing 47.88 48.64 46.52
Improvements: 9 queries in Czech, 15 queries in German, and
8 queries in French
Degradations: 2 cases for Czech, 4 cases for German, and 3
cases for French
23/37
System comparisons
Examples of translations of training queries including reference (ref ), oracle
(ora), baseline (base) and best (best) translations (system Reranker). The
scores in parentheses refer to query P@10 scores.
24/37
Adapting reranker to new languages
25/37
Queries in new languages
New SMT systems (Spanish, Hungarian, Polish and Swedish)
developed recently also within Khresmoi.
Human experts translated original English queries into these
languages, ”under KConnect project”.
We want to develop CLIR system for these languages.
26/37
Adapting reranker
To adapt the reranker, two sources of data used to create training
set:
Merged data from existing languages (Czech, French and
German)
Data from each new language (Spanish, Hungarian, Polish
and Swedish)
The data is used to create language-independent models
27/37
Language-independent model performance
Final evaluation results of language-independent models on the test set
Spanish Hungarian Polish Swedish
system P@10 P@10 P@10 P@10
Mono 50.30 50.30 50.30 47.10
Baseline 44.09 40.76 36.82 36.67
Reranker 46.36 43.18 36.67 38.79
28/37
Document translation
Last years SMT systems improved significantly
All researches regarding DT are quite old!
Reinvestigate the research question if QT is really better than
DT
29/37
Document translation
Queries are posed by users in their language
Translate the English collection into: Czech, French and
German
Create separate index for each language
Perform the retrieval using original query and the relevant
index
Index (CS)
Documents (EN)
User poses a query (CS) Ranked list of documents
MT system
Indexer
Top-K Retrieval system
Documents (CS)
30/37
Morphological processing
Both queries and documents are processed as follows:
Translate into Czech, French and German
Stemming using Snowball stemmer 1
Lemmatizing using Tree Tagger for French and German2 and
MorphoDiTa for Czech3
1
http://snowball.tartarus.org/
2
http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger
3
http://ufal.mff.cuni.cz/morphodita
31/37
Results - Document Translation
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
DT 37.42 41.67 36.21
DT Stem 41.67 42.73 36.67
DT Lem 39.39 41.06 33.18
32/37
Query expansion
Users fail sometimes to create query that represents their
information need
Query expansion is the process of adding terms to their query
(also called query reformulation)
Our approach is based on machine learning model
33/37
Query expansion
Algorithm
Get 20-best-list translations for each query
Create a translation pool as bag-of-words from these
translations
Use best translation as an original query
Model can predict a term which will give the highest P@10
when it is added to the original query
Features: IDF, TF (pool), similarity between term and query
(word-embeddings)
Expand the query with one term from the translation pool
Run the retrieval using our baseline setting using the
expanded queries.
Translation pool was limited for some queries, expand it pool from
Wikipedia articles
34/37
Results - test set
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
QE 42.12 46.21 37.88
35/37
Query expansion (QE) improved in average 10 queries over the
baseline system, only 60% coverage, wait to complete the
assessment.
35/37
Query expansion examples
Mono: white patchiness in mouth P@10: 10.00
Base: white coating mouth, P@10: 10.00
Expanded: white coating mouth oral cavity P@10: 70.00
Mono: SOB P@10: 50.00
Base: dyspnoea P@10: 60.00
Expanded: dyspnoea rash breathing dyspnea P@10: 70.00
36/37
Conclusion and future work
Monolingual IR system evaluation and assessment
Cross-lingual IR approaches:
Query translation
Document translation and morphological analysis
Query expansion based on translation pool and Wikipedia
Reranking model to predict, for each query, which translation
hypothesis gives the highest P@10
Contribution to the CLIR community by releasing dataset with
high coverage (doc/query pair)
37/37
Our publications
Shadi Saleh and Pavel Pecina. CUNI at the ShARe/CLEF eHealth Evaluation
Lab 2014. In Working Notes of CLEF 2015 - Conference and Labs of the
Evaluation forum, Sheffield, UK,2014
Shadi Saleh, Feraena Bibyna, Pavel Pecina: CUNI at the CLEF eHealth 2015
Task 2. In: Working Notes of CLEF 2015 - Conference and Labs of the
Evaluation forum, CEUR-WS, Toulouse,France, 2015
Shadi Saleh and Pavel Pecina. Adapting SMT Query Translation Reranker to
New Languages in Cross-Lingual Information Retrieval. In Medical Information
Retrieval (MedIR) Workshop, Association for Computational Linguistics, Pisa,
Italy, 2016
Shadi Saleh and Pavel Pecina. Reranking hypotheses of machine-translated
queries for cross-lingual information retrieval. In Experimental IR Meets
Multilinguality, Multimodality, and Interaction 7th International Conference of
the CLEF Association, CLEF 2016, Evora, Portugal, 2016
Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team
CUNI, CLEF 2016 Working Notes, CEUR-WS, Evora, Portugal, 2016
Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team
CUNI, CLEF 2016 Working Notes, CEUR-WS, Dublin, Ireland, 2017

Cross-lingual Information Retrieval

  • 1.
    1/37 Cross-lingual information retrieval ShadiSaleh Institute of Formal and Applied Linguistics Charles University saleh@ufal.mff.cuni.cz 27 Nov. 2017
  • 2.
    2/37 Information Retrieval Task Definition Informationretrieval (IR) is finding material (usually doc- uments) of an unstructured nature (usually text) that satis- fies an information need from within large collections (usually stored on computers). source: trec.nist.gov
  • 3.
  • 4.
  • 5.
    4/37 Information Retrieval Task Heat-maptest (golden triangle) is done by Enquiro, Eyetools, and Didit with search engine users.
  • 6.
  • 7.
    6/37 IR evaluation IR systemreturns ranked list of documents (scored by degree of relevance) Users are interested in the top k documents Development: Set of documents Set of training/test queries Metric: P@10, Percent of relevant documents among the highest 10 retrieved ones How to judge relevant/irrelevant documents? Assessment process
  • 8.
    7/37 Data & tools CLEFeHealth 2015 IR task document collection (corpus) For searching, queries from CLEF eHealth IR tasks 2013–2015, 166 queries in total Queries were created in 2013 and 2014 by medical experts In 2015, queries were created to simulate the way laypeople write queries Randomly split into 100 queries for training, 66 for test Relevance assessment is done by medical experts
  • 9.
    8/37 Sample query: CLEF2013 <t o p i c> <id>qtest4</ id> < t i t l e>nausea and vomiting and hematemesis</ t i t l e> <desc>What are nausea , vomiting and hematemesis</ desc> <narr>What i s the connection with nausea , vomiting and hematemesis</ narr> <p r o f i l e>A 64−year old emigrant who i s not sure what nausea , vomiting and hematemesis mean in h i s d i s c h a r g e summary</ p r o f i l e> </ t o p i c>
  • 10.
    9/37 Sample queries: CLEF2015 <t o p i c> <id>c l e f 2 0 1 5 . t e s t .9</ id> < t i t l e>red i t c h y eyes</ t i t l e> </ t o p i c> <t o p i c> <id>c l e f 2 0 1 5 . t e s t .16</ id> < t i t l e>red patchy b r u i s i n g over l e g s</ t i t l e> </ t o p i c> <t o p i c> <id>c l e f 2 0 1 5 . t e s t .44</ id> < t i t l e>n a i l g e t t i n g dark</ t i t l e> </ t o p i c>
  • 11.
  • 12.
  • 13.
  • 14.
    11/37 Monolingual experiment Indexing andsearching is done using Terrier (an open source IR system) 1 Set of tuning experiments P@10: 47.10 (training set) and 50.30 (test set) 1 http://terrier.org
  • 15.
    12/37 Cross-lingual IR problem Definition CrossLingual Information Retrieval provides allows a user to ask a query in native language and then to get the document in different language. Czech query Query: nevolnost a zvracen´ı a hematemeze?
  • 16.
    13/37 Cross-lingual IR approaches:query translation Index Documents (EN) User poses a query (CS) Indexer Top-K Retrieval system Ranked list of documents MT system EN query Reducing CLIR task into monolingual task
  • 17.
    14/37 Cross-lingual data 166 queriesin English were translated by native medical experts into (Czech, French, German, Hungarian, Polish, Spanish, Swedish) Task is reduced into Monolingual IR: Same relevance data
  • 18.
    15/37 Query translation experiment Translatequeries in all languages into collection language using online public MT systems: Google Translate Bing translator Sys Czech French German Hungarian Polish Spanish Swedish Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30 Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48 Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
  • 19.
    16/37 Baseline CLIR system Translatequeries into English using SMT systems, developed by colleagues at UFAL Trained to translate search queries (medical domain) Returns list of alternative translation (N-best-list) Sys Czech French German Hungarian Polish Spanish Swedish Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30 Baseline 45.76 47.88 42.58 40.76 36.82 44.09 36.67 Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48 Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
  • 20.
    17/37 Reranking approach Motivation The singlebest translation that is returned by SMT system is not selected w.r.t CLIR performance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Czech French German 01020304050 Histograms of ranks of translation hypotheses with the highest P@10 for each training query
  • 21.
    18/37 Reranking approach Trained toselect the best translation for CLIR performance P@10 as an objective function (predict the translation that gives the highest P@10) Index Documents (EN) nevolnost a zvracení a hematemeze Indexer Top-K Retrieval system Ranked list of documents MT system N-best-list (EN) Reranker EN query
  • 22.
    19/37 Feature set SMT scores:Translation model, language model and reordering models Rank features: SMT rank and a Boolean feature (1 for best rank, 0 otherwise) Features based on Blind relevance feedback IDF from the collection (inverse document frequency) Translation pool Retrieval statue value Features that are based on external resources (UMLS1, Wikipedia) 1 The Unified Medical Language System: large, multi-purpose, and multi-lingual thesaurus that contains millions of biomedical and health related concepts
  • 23.
    20/37 Training 100 queries fortraining, 15-best-list hypotheses for each query. Two approaches for training: Language-Specific: Model for each language Language-Independent: One model for all languages Leave-One-Out cross validation
  • 24.
    21/37 Reranker testing Generate vectorsof feature values for each query The trained regression model predicts the hypothesis that gives that highest P@10 Run retrieval for that hypothesis query string
  • 25.
    22/37 Results - testset Results of the final evaluation on the test set queries Czech French German system P@10 P@10 P@10 Mono 50.30 50.30 50.30 Baseline 45.61 47.73 42.42 Reranker 50.15 51.06 45.30 Google 50.91 49.70 49.39 Bing 47.88 48.64 46.52 Improvements: 9 queries in Czech, 15 queries in German, and 8 queries in French Degradations: 2 cases for Czech, 4 cases for German, and 3 cases for French
  • 26.
    23/37 System comparisons Examples oftranslations of training queries including reference (ref ), oracle (ora), baseline (base) and best (best) translations (system Reranker). The scores in parentheses refer to query P@10 scores.
  • 27.
  • 28.
    25/37 Queries in newlanguages New SMT systems (Spanish, Hungarian, Polish and Swedish) developed recently also within Khresmoi. Human experts translated original English queries into these languages, ”under KConnect project”. We want to develop CLIR system for these languages.
  • 29.
    26/37 Adapting reranker To adaptthe reranker, two sources of data used to create training set: Merged data from existing languages (Czech, French and German) Data from each new language (Spanish, Hungarian, Polish and Swedish) The data is used to create language-independent models
  • 30.
    27/37 Language-independent model performance Finalevaluation results of language-independent models on the test set Spanish Hungarian Polish Swedish system P@10 P@10 P@10 P@10 Mono 50.30 50.30 50.30 47.10 Baseline 44.09 40.76 36.82 36.67 Reranker 46.36 43.18 36.67 38.79
  • 31.
    28/37 Document translation Last yearsSMT systems improved significantly All researches regarding DT are quite old! Reinvestigate the research question if QT is really better than DT
  • 32.
    29/37 Document translation Queries areposed by users in their language Translate the English collection into: Czech, French and German Create separate index for each language Perform the retrieval using original query and the relevant index Index (CS) Documents (EN) User poses a query (CS) Ranked list of documents MT system Indexer Top-K Retrieval system Documents (CS)
  • 33.
    30/37 Morphological processing Both queriesand documents are processed as follows: Translate into Czech, French and German Stemming using Snowball stemmer 1 Lemmatizing using Tree Tagger for French and German2 and MorphoDiTa for Czech3 1 http://snowball.tartarus.org/ 2 http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger 3 http://ufal.mff.cuni.cz/morphodita
  • 34.
    31/37 Results - DocumentTranslation Results of the final evaluation on the test set queries Czech French German system P@10 P@10 P@10 Mono 50.30 50.30 50.30 Baseline 45.61 47.73 42.42 DT 37.42 41.67 36.21 DT Stem 41.67 42.73 36.67 DT Lem 39.39 41.06 33.18
  • 35.
    32/37 Query expansion Users failsometimes to create query that represents their information need Query expansion is the process of adding terms to their query (also called query reformulation) Our approach is based on machine learning model
  • 36.
    33/37 Query expansion Algorithm Get 20-best-listtranslations for each query Create a translation pool as bag-of-words from these translations Use best translation as an original query Model can predict a term which will give the highest P@10 when it is added to the original query Features: IDF, TF (pool), similarity between term and query (word-embeddings) Expand the query with one term from the translation pool Run the retrieval using our baseline setting using the expanded queries. Translation pool was limited for some queries, expand it pool from Wikipedia articles
  • 37.
    34/37 Results - testset Results of the final evaluation on the test set queries Czech French German system P@10 P@10 P@10 Mono 50.30 50.30 50.30 Baseline 45.61 47.73 42.42 QE 42.12 46.21 37.88
  • 38.
    35/37 Query expansion (QE)improved in average 10 queries over the baseline system, only 60% coverage, wait to complete the assessment.
  • 39.
    35/37 Query expansion examples Mono:white patchiness in mouth P@10: 10.00 Base: white coating mouth, P@10: 10.00 Expanded: white coating mouth oral cavity P@10: 70.00 Mono: SOB P@10: 50.00 Base: dyspnoea P@10: 60.00 Expanded: dyspnoea rash breathing dyspnea P@10: 70.00
  • 40.
    36/37 Conclusion and futurework Monolingual IR system evaluation and assessment Cross-lingual IR approaches: Query translation Document translation and morphological analysis Query expansion based on translation pool and Wikipedia Reranking model to predict, for each query, which translation hypothesis gives the highest P@10 Contribution to the CLIR community by releasing dataset with high coverage (doc/query pair)
  • 41.
    37/37 Our publications Shadi Salehand Pavel Pecina. CUNI at the ShARe/CLEF eHealth Evaluation Lab 2014. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Sheffield, UK,2014 Shadi Saleh, Feraena Bibyna, Pavel Pecina: CUNI at the CLEF eHealth 2015 Task 2. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, CEUR-WS, Toulouse,France, 2015 Shadi Saleh and Pavel Pecina. Adapting SMT Query Translation Reranker to New Languages in Cross-Lingual Information Retrieval. In Medical Information Retrieval (MedIR) Workshop, Association for Computational Linguistics, Pisa, Italy, 2016 Shadi Saleh and Pavel Pecina. Reranking hypotheses of machine-translated queries for cross-lingual information retrieval. In Experimental IR Meets Multilinguality, Multimodality, and Interaction 7th International Conference of the CLEF Association, CLEF 2016, Evora, Portugal, 2016 Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team CUNI, CLEF 2016 Working Notes, CEUR-WS, Evora, Portugal, 2016 Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team CUNI, CLEF 2016 Working Notes, CEUR-WS, Dublin, Ireland, 2017