SlideShare a Scribd company logo
1 of 19
Cross Language IR
Cross Language Information
Retrieval
Multilingual Collections
There are 6,703 languages listed in the Ethnologue
Digital libraries
OCLC Online Computer Library Center serves more
than 17,000 libraries in 52 countries and contains over 30
million bibliographic records with over 500 million
records ownership attached in more than 370 languages
World Wide Web
Around 40% of Internet users do not speak English,
however, 80% of Web sites are still in English
Hsin-Hsi Chen 10-3
The General Problem
 Find documents written in any language
 Using queries expressed in a single language
 Traditional IR identifies relevant documents in the same
language as the query (monolingual IR)
 Cross-language information retrieval (CLIR) tries to
identify relevant documents in a language different from
that of the query
 This problem is more and more acute for IR on the Web
due to the fact that the Web is a truly multilingual
environment
4
Why is CLIR important?
5
Global Internet User Population
6Source: Global Reach
5%
9%
6%
5%
5%
4%
4%
3%
2%
2%
3%
52%
Spanish Japanese German French
Chinese Scandanavian Italian Dutch
Korean Portuguese Other English
8%
8%
5%
3%
21%
2%5%2%
5%
3%
6%
32%
Spanish Japanese German
French Chinese Scandanavian
Italian Dutch Korean
Portuguese Other English
8%
12%
6%
4%
8%
2%
5%
2%6%2%
5%
40%
Spanish Japanese German French
Chinese Scandanavian Italian Dutch
Korean Portuguese Other English
English
English
2000 2005
Chinese
Multilingual Collections
There are 6,703 languages listed in the Ethnologue
Digital libraries
OCLC Online Computer Library Center serves more
than 17,000 libraries in 52 countries and contains over 30
million bibliographic records with over 500 million
records ownership attached in more than 370 languages
World Wide Web
Around 40% of Internet users do not speak English,
however, 80% of Web sites are still in English
Hsin-Hsi Chen 10-7
Importance of CLIR
 CLIR research is becoming more and more important
for global information exchange and knowledge
sharing.
 National Security
 Foreign Patent Information Access
 Medical Information Access for Patients
8
CLIR is Multidisciplinary
CLIR involves researchers from the following
fields:
information retrieval, natural language
processing, machine translation and
summarization, speech processing,
document image understanding,
human-computer interaction
9
User Needs
 Search a monolingual collection in a language that the
user cannot read.
 Retrieve information from a multilingual collection
using a query in a single language.
 Select images from a collection indexed with free text
captions in an unfamiliar language.
 Locate documents in a multilingual collection of
scanned page images.
10
Why Do Cross-Language IR?
 When users can read several languages
 Eliminates multiple queries
 Query in most fluent language
 Monolingual users can also benefit
 If translations can be provided
 If it suffices to know that a document exists
 If text captions are used to search for images
11
CLIR Experimental System
 2 systems:
 SMART Information retrieval system modified to work with 11
European languages (Danish, Dutch, English, Finnish, French,
German, Italian, Norwegian, Portuguese, Spanish, Swedish)
 Generation of restricted bigrams
 Pseudo-Relevance feedback
 TAPIR is a language model IR system written by M. Srikanth. It
has been adated to work with 12 different European languages
(Danish, Dutch, English, Finnish, French, German, Italian,
Norwegian, Portuguese, Russian, Spanish, Swedish)
12
Approaches to CLIR
13
Design Decisions
 What to index?
 Free text or controlled vocabulary
 What to translate?
 Queries or documents
 Where to get translation knowledge?
 Dictionary, ontology, training corpus
14
15
Term-aligned Sentence-aligned Document-aligned Unaligned
Parallel Comparable
Knowledge-based Corpus-based
Controlled Vocabulary Free Text
Cross-Language Text Retrieval
Query Translation Document Translation
Text Translation Vector Translation
Ontology-based Dictionary-based
Thesaurus-based
Problems with CLIR
 Morphological processing difficult for some languages
(e.g. Arabic)
 Many different encodings for Arabic
 Windows Arabic (e.g. dictionaries)
 Unicode (UTF-8) (e.g. corpus)
 Macintosh Arabic (e.g. queries)
 Normalization
 Remove diacritics
 ‫ة‬َّ‫ي‬ِ‫ب‬َ‫ر‬َ‫ع‬‫ال‬ to ‫ية‬ِ‫ب‬‫العر‬ Arabic (language)
 Standardize spellings for foreign names
 ‫آلينتون‬ vs ‫آلنتون‬ “Kleentoon” vs “Klntoon” for Clinton
16
Problems with CLIR (contd)
 Morphological processing (contd.)
 Arabic stemming
Root + patterns+suffixes+prefixes=word
ktb+CiCaC=kitab
All verbs and nouns derived from fewer than 2000 roots
Roots too abstract for information retrieval
ktb → kitab a book kitabi my book
alkitab the book kitabuki your book (f)
kataba to write kitabuka your book (m)
maktab office kitabuhu his book
maktaba library, bookstore ...
Want stem=root+pattern+derivational affixes?
No standard stemmers available,
only morphological (root) analyzers
17
Problems with CLIR (contd)
 Availability of resources
 Names and phrases are very important, most lexicons do
not have good coverage
 Difficult to get hold of bilingual dictionaries
 can sometimes be found on the Web
 e.g. for recent Arabic cross-lingual evaluation we used 3 on-
line Arabic- English dictionaries (including harvesting) and a
small lexicon of country and city names
 Parallel corpora are more difficult and require more
formal arrangements
18
CLIR better than IR?
 How can cross-language beat within-language?
 We know there are translation errors
 Surely those errors should hurt performance
 Hypothesis is that translation process may disambiguate
some query terms
 Words that are ambiguous in Arabic may not be ambiguous in
English
 Expansion during translation from English to Arabic prevents
the ambiguity from re-appearing
 Has been proposed that CLIR is a model for IR
 Translate query into one language and then back to original
 Given hypothesis, should have an improved query
 Should be reasonable to do this across many different languages
19

More Related Content

What's hot

Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...Iman Mirrezaei
 
download
downloaddownload
downloadbutest
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Riccardo Albertoni
 
C programming disk file reading and writing
C programming disk file reading and writingC programming disk file reading and writing
C programming disk file reading and writingrishi ram khanal
 

What's hot (15)

Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
download
downloaddownload
download
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
SAC 2019 ester giallonardo
SAC 2019 ester giallonardoSAC 2019 ester giallonardo
SAC 2019 ester giallonardo
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
 
presentation
presentationpresentation
presentation
 
C programming disk file reading and writing
C programming disk file reading and writingC programming disk file reading and writing
C programming disk file reading and writing
 

Viewers also liked

Viewers also liked (6)

Lec 2
Lec 2Lec 2
Lec 2
 
Lec1
Lec1Lec1
Lec1
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
lec6
lec6lec6
lec6
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 

Similar to Ir 1 lec 7

Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
IS-EUD-2015, Madrid, Spain, 27 May 2015
IS-EUD-2015, Madrid, Spain, 27 May 2015IS-EUD-2015, Madrid, Spain, 27 May 2015
IS-EUD-2015, Madrid, Spain, 27 May 2015Charith Perera
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfJemalNesre1
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesMichael Nelson
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked DataMarco Fossati
 
Forum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationForum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationCELI
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Lucidworks
 
Processing short-message communications in low-resource languages
Processing short-message communications in low-resource languages�Processing short-message communications in low-resource languages�
Processing short-message communications in low-resource languages Robert Munro
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 
Inglês para fins específicos Inglês 1
Inglês para fins específicos Inglês 1Inglês para fins específicos Inglês 1
Inglês para fins específicos Inglês 1Cintia Santos
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Janifer Gatenby
 
Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyDr. Jayarama Reddy
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsMauro Dragoni
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)
Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)
Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)Alannah Fitzgerald
 

Similar to Ir 1 lec 7 (20)

Swift vs. Language X
Swift vs. Language XSwift vs. Language X
Swift vs. Language X
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
IS-EUD-2015, Madrid, Spain, 27 May 2015
IS-EUD-2015, Madrid, Spain, 27 May 2015IS-EUD-2015, Madrid, Spain, 27 May 2015
IS-EUD-2015, Madrid, Spain, 27 May 2015
 
Accentuate Us!
Accentuate Us!Accentuate Us!
Accentuate Us!
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked Data
 
Forum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentationForum Tal 2014: Celi company presentation
Forum Tal 2014: Celi company presentation
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
 
Processing short-message communications in low-resource languages
Processing short-message communications in low-resource languages�Processing short-message communications in low-resource languages�
Processing short-message communications in low-resource languages
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Inglês para fins específicos Inglês 1
Inglês para fins específicos Inglês 1Inglês para fins específicos Inglês 1
Inglês para fins específicos Inglês 1
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08
 
Programming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddyProgramming languages in bioinformatics by dr. jayarama reddy
Programming languages in bioinformatics by dr. jayarama reddy
 
download
downloaddownload
download
 
download
downloaddownload
download
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR Systems
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)
Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)
Flexible, Free and Open Data-Driven Learning for the Masses (MOOCs)
 

Ir 1 lec 7

  • 3. Multilingual Collections There are 6,703 languages listed in the Ethnologue Digital libraries OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages World Wide Web Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English Hsin-Hsi Chen 10-3
  • 4. The General Problem  Find documents written in any language  Using queries expressed in a single language  Traditional IR identifies relevant documents in the same language as the query (monolingual IR)  Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query  This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment 4
  • 5. Why is CLIR important? 5
  • 6. Global Internet User Population 6Source: Global Reach 5% 9% 6% 5% 5% 4% 4% 3% 2% 2% 3% 52% Spanish Japanese German French Chinese Scandanavian Italian Dutch Korean Portuguese Other English 8% 8% 5% 3% 21% 2%5%2% 5% 3% 6% 32% Spanish Japanese German French Chinese Scandanavian Italian Dutch Korean Portuguese Other English 8% 12% 6% 4% 8% 2% 5% 2%6%2% 5% 40% Spanish Japanese German French Chinese Scandanavian Italian Dutch Korean Portuguese Other English English English 2000 2005 Chinese
  • 7. Multilingual Collections There are 6,703 languages listed in the Ethnologue Digital libraries OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages World Wide Web Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English Hsin-Hsi Chen 10-7
  • 8. Importance of CLIR  CLIR research is becoming more and more important for global information exchange and knowledge sharing.  National Security  Foreign Patent Information Access  Medical Information Access for Patients 8
  • 9. CLIR is Multidisciplinary CLIR involves researchers from the following fields: information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction 9
  • 10. User Needs  Search a monolingual collection in a language that the user cannot read.  Retrieve information from a multilingual collection using a query in a single language.  Select images from a collection indexed with free text captions in an unfamiliar language.  Locate documents in a multilingual collection of scanned page images. 10
  • 11. Why Do Cross-Language IR?  When users can read several languages  Eliminates multiple queries  Query in most fluent language  Monolingual users can also benefit  If translations can be provided  If it suffices to know that a document exists  If text captions are used to search for images 11
  • 12. CLIR Experimental System  2 systems:  SMART Information retrieval system modified to work with 11 European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish)  Generation of restricted bigrams  Pseudo-Relevance feedback  TAPIR is a language model IR system written by M. Srikanth. It has been adated to work with 12 different European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish) 12
  • 14. Design Decisions  What to index?  Free text or controlled vocabulary  What to translate?  Queries or documents  Where to get translation knowledge?  Dictionary, ontology, training corpus 14
  • 15. 15 Term-aligned Sentence-aligned Document-aligned Unaligned Parallel Comparable Knowledge-based Corpus-based Controlled Vocabulary Free Text Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Ontology-based Dictionary-based Thesaurus-based
  • 16. Problems with CLIR  Morphological processing difficult for some languages (e.g. Arabic)  Many different encodings for Arabic  Windows Arabic (e.g. dictionaries)  Unicode (UTF-8) (e.g. corpus)  Macintosh Arabic (e.g. queries)  Normalization  Remove diacritics  ‫ة‬َّ‫ي‬ِ‫ب‬َ‫ر‬َ‫ع‬‫ال‬ to ‫ية‬ِ‫ب‬‫العر‬ Arabic (language)  Standardize spellings for foreign names  ‫آلينتون‬ vs ‫آلنتون‬ “Kleentoon” vs “Klntoon” for Clinton 16
  • 17. Problems with CLIR (contd)  Morphological processing (contd.)  Arabic stemming Root + patterns+suffixes+prefixes=word ktb+CiCaC=kitab All verbs and nouns derived from fewer than 2000 roots Roots too abstract for information retrieval ktb → kitab a book kitabi my book alkitab the book kitabuki your book (f) kataba to write kitabuka your book (m) maktab office kitabuhu his book maktaba library, bookstore ... Want stem=root+pattern+derivational affixes? No standard stemmers available, only morphological (root) analyzers 17
  • 18. Problems with CLIR (contd)  Availability of resources  Names and phrases are very important, most lexicons do not have good coverage  Difficult to get hold of bilingual dictionaries  can sometimes be found on the Web  e.g. for recent Arabic cross-lingual evaluation we used 3 on- line Arabic- English dictionaries (including harvesting) and a small lexicon of country and city names  Parallel corpora are more difficult and require more formal arrangements 18
  • 19. CLIR better than IR?  How can cross-language beat within-language?  We know there are translation errors  Surely those errors should hurt performance  Hypothesis is that translation process may disambiguate some query terms  Words that are ambiguous in Arabic may not be ambiguous in English  Expansion during translation from English to Arabic prevents the ambiguity from re-appearing  Has been proposed that CLIR is a model for IR  Translate query into one language and then back to original  Given hypothesis, should have an improved query  Should be reasonable to do this across many different languages 19