April 7, 2006
Natural Language Processing/Language
Technology for the Web
Cross-Language Information
Retrieval (CLIR)
Gouranga Charan Jena
Computer Science & Engg., KIIT University.
Guide Name: Dr. Siddharth Swarup Rautaray
Cross Language Information Retrieval
(CLIR)
Definition :
“A subfield of information retrieval dealing with retrieving
information written in a language different from the
language of the user's query.”
E.g., Using Odia/Hindi queries to retrieve English
documents
Also called multi-lingual, cross-lingual, or trans-lingual
IR.
Why CLIR?
E.g., On the web, we have:
 Documents in different languages
 Multilingual documents
 Images with captions in different languages
A single query should retrieve all such resources.
Approaches to CLIR
Knowledge-
based
Corpus-based
Query Translation Dictionary/Thes
aurus-based
Pseudo-
Relevance
Feedback (PRF)
Document
Translation
MT
(rule-based)
MT
(EBMT/StatMT)
Intermediate
Representation
UNL
(AgroExplorer)
Latent Semantic
Indexing
Most effective approaches are hybrid – a combination of knowledge
and corpus-based methods.
most
efficient;
commonly
used
infeasible
for
large
collections
Dictionary-based Query Translation
Ireland
peace
talks
Hindi-English
dictionaries
Collection
search
• phrase identification
• words to be transliterated
The problem with dictionary-based
CLIR -- ambiguity
cosmic outer-space
incident event occurrence
lessen subside decrease lower
diminish ebb decline reduce
lattice mesh net wire_netting
meshed_fabric counterfeit
forged false fabricated
small_net network gauze
grating sieve
money riches wealth appositive
property
Ireland
peace calm tranquility silence
quietude
conversation talk negotiation
tale
… filtering/disambiguation is required after
query translation.
Disambiguation using
co-occurrence statistics
Hypothesis: correct translations of query terms will
co-occur and incorrect translations will tend not
to co-occur
Problem with counting co-occurrences:
data sparsity
freq(Marathi Shallow Parsing CRFs)
freq(Marathi Shallow Structuring CRFs)
freq(Marathi Shallow Analyzing CRFs)
… are all zero.
How do we choose between parsing,
structuring, and analyzing?
Pair-wise co-occurrence
cosmic outer-space
incident event occurrence lessen subside decrease lower diminish ebb
decline reduce
freq(cosmic incident)  70800
freq(cosmic event  269000
freq(cosmic lessen)  7130
freq(cosmic subside)  3120
freq(outer-space incident)  26100
freq(outer-space event)  104000
freq(outer-space lessen)  2600
freq(outer-space subside)  980
Shallow Parsing, Structuring or Analyzing?
shallow parsing  166000
shallow structuring  180000
shallow analyzing  1230000
CRFs parsing  540
CRFs structuring  125
CRFs analyzing  765
Marathi parsing  17100
Marathi structuring  511
Marathi analyzing  12200
“shallow parsing”  40700
“shallow structuring”  11
“shallow analyzing”  2
collocation?
But,
analyzing  74100000
parsing  40400000
structuring  17400000
shallow  33300000
Ranking senses using co-occurrence
statistics
 Use co-occurrence scores to calculate
similarity between two words: sim(x, y)
 Point-wise mutual information (PMI)
 Dice coefficient
 PMI-IR
)()(
)(
log),(-
yhitsxhits
yxhits
yxIRPMI
AND
×
=
Disambiguation algorithm
},...,{
:querysuser'
21
s
m
ss
qqqq =
}{
ons,translatiofsetthe,eachFor
,
t
jii
s
i
wS
q
=
∑
∈∀
=
','
'' ),(),(.1 ,,,
i
t
li
Sw
t
li
t
jii
t
ji wwsimSwsim
∑
≠∀
=
ii
i
t
ji
t
ji Swsimwscore
'
),()(.2 ',,
},...,,{
querytranslated
21
t
m
ttt
qqqq =
)(maxarg.3 ,
,
t
ji
w
t
i wscoreq
t
ji
=
Example
cosmic outer-space
incident event lessen subside decrease lower
diminish ebb decline reduce
score(cosmic)= PMI-IR(cosmic, incident) +
PMI-IR(cosmic, event) +
PMI-IR(cosmic, lessen) +
PMI-IR(cosmic, subside) …
Disambiguation algorithm: sample outputs
Ireland peace talks
cosmic events
net money (?)
Results on TREC8 (disks 4 and 5)
 English topics (401-450) manually translated to Hindi
 Assumption: relevance judgments for English topics
hold for the translated queries
 Results (all TF-IDF):
Technique MAP
Monolingual 23
All-translations 16
PMI based disambiguation 20.5
Manual filtering 21.5
Pseudo-Relevance Feedback for CLIR
(User) Relevance Feedback (mono-lingual)
1. Retrieve documents using the user’s query
2. The user marks relevant documents
3. Choose the top N terms from these
documents
 Top terms  IDF is one option for scoring
1. Add these N terms to the user’s query to
form a new query
2. Use this new query to retrieve a new set of
documents
Pseudo-Relevance Feedback (PRF)
(mono-lingual)
1. Retrieve documents using the user’s query
2. Assume that the top M documents retrieved
are relevant
3. Choose the top N terms from these M
documents
4. Add these N terms to the user’s query to
form a new query
5. Use this new query to retrieve a new set of
documents
PRF for CLIR
Corpus-based Query Translation
 Uses a parallel corpus of documents:
H1  E1
H2  E2
. .
. .
. .
Hm Em
Hindi collection H English collection E
PRF for CLIR
1. Retrieve documents in H using the user’s query
2. Assume that the top M documents retrieved are
relevant
3. Select the M documents in E that are aligned to
the top M retrieved documents
4. Choose the top N terms from these documents
5. These N terms are the translated query
6. Use this query to retrieve from the target collection
(which is in the same language as E)
Cross-Lingual Relevance Models
- Estimate relevance models using a parallel corpus
Ranking with Relevance Models
 Relevance model or Query
model (distribution encodes
the information need):
 Probability of word
occurrence in a relevant
document
 Probability of word
occurrence in the candidate
document
 Ranking function (relative
entropy or KL divergence)
RΘ
)|( RwP Θ
)|( DwP
∑ Θ
=
w RwP
DwP
DwP
RDKL
)|(
)|(
log).|(
)||(
Estimating Mono-Lingual Relevance
Models
)...(
)...,(
)...|()|()|(
21
21
21
m
m
mR
hhhP
hhhwP
hhhwPQwPwP
=
=≈Θ
∑ ∏Μ∈ =






=
M
m
i
im MhPMwPMPhhhwP
1
21 )|()|()()...,(
Estimating Cross-Lingual Relevance Models
∑ ∏Μ∈ =






=
},{ 1
21 )|()|(}),({)...,(
EH MM
m
i
HiEEHm MhPMwPMMPhhhwP
)()1()|(
,
,
wP
freq
freq
MwP
v Xv
Xw
X λλ −+








=
∑
CLIR Evaluation – TREC
(Text REtrieval Conference)
 TREC CLIR track (2001 and 2002)
 Retrieval of Arabic language newswire documents from
topics in English
 383,872 Arabic documents (896 MB) with SGML markup
 50 topics
 Use of provided resources (stemmers, bilingual
dictionaries, MT systems, parallel corpora) is
encouraged to minimize variability
http://trec.nist.gov/
CLIR Evaluation – CLEF
(Cross Language Evaluation Forum)
 Major CLIR evaluation forum
 Tracks include
 Multilingual retrieval on news collections
 topics will be provided in many languages including Hindi
 Multiple language Question Answering
 ImageCLEF
 Cross Language Speech Retrieval
 WebCLEF
http://www.clef-campaign.org/
Summary
 CLIR techniques
 Query Translation-based
 Document Translation-based
 Intermediate Representation-based
 Query translation using dictionaries, followed by
disambiguation, is a simple and effective technique
for CLIR
 PRF uses a parallel corpus for query translation
 Parallel corpora can also be used to estimate cross-
lingual relevance models
 CLEF and TREC: important CLIR evaluation
conferences
References (1)
1. Phrasal Translation and Query Expansion Techniques for Cross-
language Information Retrieval, Lisa Ballesteros and W. Bruce
Croft, Research and Development in Information Retrieval, 1995.
2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros
and W. Bruce Croft, Research and Development in Information
Retrieval, 1998.
3. A Maximum Coherence Model for Dictionary-Based Cross-
Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y.
Chai, ACM SIGIR, 2005.
4. A Comparative Study of Knowledge-Based Approaches for Cross-
Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr,
Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-
3897, University of Maryland, 1998.
References (2)
5. Translingual Information Retrieval: A Comparative Evaluation,
Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D.
Brown, Yibing Geng, and Danny Lee, International Joint
Conference on Artificial Intelligence, 1997.
6. A Multistage Search Strategy for Cross Lingual Information
Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak
Bhattacharyya, Symposium on Indian Morphology, Phonology
and Language Engineering, IIT Kharagpur, February, 2005.
7. Relevance-Based Language Models, Victor Lavrenko, and W.
Bruce Croft, Research and Development in Information
Retrieval, 2001.
8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette,
and W. Croft, ACM-SIGIR, 2002.
Thank You

07 04-06

  • 1.
    April 7, 2006 NaturalLanguage Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Gouranga Charan Jena Computer Science & Engg., KIIT University. Guide Name: Dr. Siddharth Swarup Rautaray
  • 2.
    Cross Language InformationRetrieval (CLIR) Definition : “A subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query.” E.g., Using Odia/Hindi queries to retrieve English documents Also called multi-lingual, cross-lingual, or trans-lingual IR.
  • 3.
    Why CLIR? E.g., Onthe web, we have:  Documents in different languages  Multilingual documents  Images with captions in different languages A single query should retrieve all such resources.
  • 4.
    Approaches to CLIR Knowledge- based Corpus-based QueryTranslation Dictionary/Thes aurus-based Pseudo- Relevance Feedback (PRF) Document Translation MT (rule-based) MT (EBMT/StatMT) Intermediate Representation UNL (AgroExplorer) Latent Semantic Indexing Most effective approaches are hybrid – a combination of knowledge and corpus-based methods. most efficient; commonly used infeasible for large collections
  • 5.
  • 6.
    The problem withdictionary-based CLIR -- ambiguity cosmic outer-space incident event occurrence lessen subside decrease lower diminish ebb decline reduce lattice mesh net wire_netting meshed_fabric counterfeit forged false fabricated small_net network gauze grating sieve money riches wealth appositive property Ireland peace calm tranquility silence quietude conversation talk negotiation tale
  • 7.
    … filtering/disambiguation isrequired after query translation.
  • 8.
    Disambiguation using co-occurrence statistics Hypothesis:correct translations of query terms will co-occur and incorrect translations will tend not to co-occur
  • 9.
    Problem with countingco-occurrences: data sparsity freq(Marathi Shallow Parsing CRFs) freq(Marathi Shallow Structuring CRFs) freq(Marathi Shallow Analyzing CRFs) … are all zero. How do we choose between parsing, structuring, and analyzing?
  • 10.
    Pair-wise co-occurrence cosmic outer-space incidentevent occurrence lessen subside decrease lower diminish ebb decline reduce freq(cosmic incident)  70800 freq(cosmic event  269000 freq(cosmic lessen)  7130 freq(cosmic subside)  3120 freq(outer-space incident)  26100 freq(outer-space event)  104000 freq(outer-space lessen)  2600 freq(outer-space subside)  980
  • 11.
    Shallow Parsing, Structuringor Analyzing? shallow parsing  166000 shallow structuring  180000 shallow analyzing  1230000 CRFs parsing  540 CRFs structuring  125 CRFs analyzing  765 Marathi parsing  17100 Marathi structuring  511 Marathi analyzing  12200 “shallow parsing”  40700 “shallow structuring”  11 “shallow analyzing”  2 collocation? But, analyzing  74100000 parsing  40400000 structuring  17400000 shallow  33300000
  • 12.
    Ranking senses usingco-occurrence statistics  Use co-occurrence scores to calculate similarity between two words: sim(x, y)  Point-wise mutual information (PMI)  Dice coefficient  PMI-IR )()( )( log),(- yhitsxhits yxhits yxIRPMI AND × =
  • 13.
  • 14.
    ∑ ∈∀ = ',' '' ),(),(.1 ,,, i t li Sw t li t jii t jiwwsimSwsim ∑ ≠∀ = ii i t ji t ji Swsimwscore ' ),()(.2 ',, },...,,{ querytranslated 21 t m ttt qqqq = )(maxarg.3 , , t ji w t i wscoreq t ji =
  • 15.
    Example cosmic outer-space incident eventlessen subside decrease lower diminish ebb decline reduce score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) + PMI-IR(cosmic, lessen) + PMI-IR(cosmic, subside) …
  • 16.
    Disambiguation algorithm: sampleoutputs Ireland peace talks cosmic events net money (?)
  • 17.
    Results on TREC8(disks 4 and 5)  English topics (401-450) manually translated to Hindi  Assumption: relevance judgments for English topics hold for the translated queries  Results (all TF-IDF): Technique MAP Monolingual 23 All-translations 16 PMI based disambiguation 20.5 Manual filtering 21.5
  • 18.
  • 19.
    (User) Relevance Feedback(mono-lingual) 1. Retrieve documents using the user’s query 2. The user marks relevant documents 3. Choose the top N terms from these documents  Top terms  IDF is one option for scoring 1. Add these N terms to the user’s query to form a new query 2. Use this new query to retrieve a new set of documents
  • 20.
    Pseudo-Relevance Feedback (PRF) (mono-lingual) 1.Retrieve documents using the user’s query 2. Assume that the top M documents retrieved are relevant 3. Choose the top N terms from these M documents 4. Add these N terms to the user’s query to form a new query 5. Use this new query to retrieve a new set of documents
  • 21.
    PRF for CLIR Corpus-basedQuery Translation  Uses a parallel corpus of documents: H1  E1 H2  E2 . . . . . . Hm Em Hindi collection H English collection E
  • 22.
    PRF for CLIR 1.Retrieve documents in H using the user’s query 2. Assume that the top M documents retrieved are relevant 3. Select the M documents in E that are aligned to the top M retrieved documents 4. Choose the top N terms from these documents 5. These N terms are the translated query 6. Use this query to retrieve from the target collection (which is in the same language as E)
  • 23.
    Cross-Lingual Relevance Models -Estimate relevance models using a parallel corpus
  • 24.
    Ranking with RelevanceModels  Relevance model or Query model (distribution encodes the information need):  Probability of word occurrence in a relevant document  Probability of word occurrence in the candidate document  Ranking function (relative entropy or KL divergence) RΘ )|( RwP Θ )|( DwP ∑ Θ = w RwP DwP DwP RDKL )|( )|( log).|( )||(
  • 25.
    Estimating Mono-Lingual Relevance Models )...( )...,( )...|()|()|( 21 21 21 m m mR hhhP hhhwP hhhwPQwPwP = =≈Θ ∑∏Μ∈ =       = M m i im MhPMwPMPhhhwP 1 21 )|()|()()...,(
  • 26.
    Estimating Cross-Lingual RelevanceModels ∑ ∏Μ∈ =       = },{ 1 21 )|()|(}),({)...,( EH MM m i HiEEHm MhPMwPMMPhhhwP )()1()|( , , wP freq freq MwP v Xv Xw X λλ −+         = ∑
  • 27.
    CLIR Evaluation –TREC (Text REtrieval Conference)  TREC CLIR track (2001 and 2002)  Retrieval of Arabic language newswire documents from topics in English  383,872 Arabic documents (896 MB) with SGML markup  50 topics  Use of provided resources (stemmers, bilingual dictionaries, MT systems, parallel corpora) is encouraged to minimize variability http://trec.nist.gov/
  • 28.
    CLIR Evaluation –CLEF (Cross Language Evaluation Forum)  Major CLIR evaluation forum  Tracks include  Multilingual retrieval on news collections  topics will be provided in many languages including Hindi  Multiple language Question Answering  ImageCLEF  Cross Language Speech Retrieval  WebCLEF http://www.clef-campaign.org/
  • 29.
    Summary  CLIR techniques Query Translation-based  Document Translation-based  Intermediate Representation-based  Query translation using dictionaries, followed by disambiguation, is a simple and effective technique for CLIR  PRF uses a parallel corpus for query translation  Parallel corpora can also be used to estimate cross- lingual relevance models  CLEF and TREC: important CLIR evaluation conferences
  • 30.
    References (1) 1. PhrasalTranslation and Query Expansion Techniques for Cross- language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995. 2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1998. 3. A Maximum Coherence Model for Dictionary-Based Cross- Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y. Chai, ACM SIGIR, 2005. 4. A Comparative Study of Knowledge-Based Approaches for Cross- Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR- 3897, University of Maryland, 1998.
  • 31.
    References (2) 5. TranslingualInformation Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997. 6. A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005. 7. Relevance-Based Language Models, Victor Lavrenko, and W. Bruce Croft, Research and Development in Information Retrieval, 2001. 8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette, and W. Croft, ACM-SIGIR, 2002.
  • 32.