Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Cross Language Text Retrieval: A Review P.ISWARYA*, Dr.V.RADHA** *(Department of Computer Science, Avinashilingam Institute for Home science and Higher Education for Women, Coimbatore-43) ** (Department of Computer science, Avinashilingam Institute for Home science and Higher Education for Women, Coimbatore-43)ABSTRACT The World Wide Web has evolved into a information should be easily accessible andtremendous source of information and it digestible. With 100 million internet users, India is atcontinues to grow at exponent rate. Now a day’s 3 rd place globally in usage of internet. Though theweb servers are storing different types of contents network shrank the globe, the languagein different languages and their usage is diversification is a great barrier to attain full benefitincreasing rapidly. According to Online of the digital life. Hence there is a need to develop aComputer Library Center, English is still a technique like Cross Language Text Retrieval whichdominant language in the Web. Only small is used to retrieve text documents in a language otherpercentage of population is familiar with English than the user used to specify the query. Thereforelanguage and they can express their queries in Internet is no longer monolingual and non EnglishEnglish to access the content in a right way. Due contents are accessed rapidly. There are threeto globalization, content storage and retrieval different types of AdHoc CLTR which are asmust be possible in all languages, which is follows:essential for developing nations like India.  Monolingual Information Retrieval System –Diversity of languages is becoming great barrier refers to Information Retrieval System thatto understand and enjoy the benefits in digital can find relevant documents in the sameworld. Cross Language Information Retrieval language as the query was expressed.(CLIR) is a subfield of Information retrieval;  Bilingual Information Retrieval System thatuser can retrieve the objects in a language allows you to querying in one language anddifferent from the language of query expressed. finding documents in another language.These objects may be text documents, passages,  Multilingual Information Retrieval System-audio or video and images. Cross Language Text allows you to make query in one languageRetrieval (CLTR) is used to return text and able to find documents in multipledocuments in a language other than query languages.language. CLTR technique allows crossing the This paper continues to focus on following sections:language barrier and accessing the web content Section 2 explains about different translation types inin an efficient way. This paper reviews some types CLTR and Section 3 about how these translationsof translations carried out in CLTR, ranking can be carried out using different approaches.methods used in retrieval of documents and some Section 4 states about various methods used forof their related works. It also discusses about ranking the results while retrieving the documents.various approaches and their evaluation Section 5 deals with previous work in CLTR. Finallymeasures used in various applications. conclusion is presented in Section 6.Keywords- Ambiguity, Bilingual, 2. TYPE OF TRANSLATION IN CLTRDisambiguation, Monolingual, Multilingual, The most important issue in CrossTransliteration. Language Text Retrieval is that queries and documents are in different languages. When the user1. INTRODUCTION pose a query in one language, either query or The number of Web users accessing the document or both translation takes place. TheseInternet become increasing day to day because translations are done using free text and controlledpeople can access any kind of required information vocabulary approaches. The following Fig1 showsat anytime. Information Retrieval (IR) mainly refers the overview of CLTR system.to a process that the user can find requiredinformation or knowledge from corpus includingdifferent kinds of information [1]. InformationRetrieval is the fact that there is vast amount ofgarbage that surrounds any useful information; such 1036 | P a g e
  2. 2. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Cross Language Text Retrieval using Machine Translation, Bilingual dictionary and corpus resource. Query translation is simpler than translation of whole documents and it is cost Query Translation Document translation efficient too. The performance of the system heavily depends on how the query is translated effectively. But there are several complexities in achieving good Controlled Vocabulary Free Text query translation they are translation ambiguity, missing terminology, idioms and compound words and untranslatable terms. The overall process Machine involved in Query translation Process is shown in Dictionary Ontology Corpus-based Fig 2. Translation Based Based 2.2 Document Translation Translation can be done in other way by translating Comparable Monolingual Parallel the documents into the language of query and this corpora corpora Corpora document translation achieved manually or through various machine translation systems. Translating 400 Figure 1: Overview of CLTR system million non-English web pages of World Wide Web would take 100,000 days (300 years) in one fast PC 2.1 Query Translation or 1 month in 3600 PC [3]. However, it is an Usually the query is translated into the expensive job to be done once for each query language of target collection of documents. The first language and most importantly the quality of the step involves parsing the natural language query translation will be much better because documents specified by the user in their native language. The provide much more context for a machine translation given query sentence is segmented and indexed using program to work with. In particular, when it comes Morphological analyzer, Part Of Speech tagging, to minority languages, the cost becomes almost Stemming and Stop word removal. unbearable. 3. VARIOUS APPROACHES FOR QUERY WordNet/S ynonymn AND DOCUMENT TRANSLATIONS User query Query Expansion As mentioned above the query and document can be translated using the following Ontology different techniques such as Controlled vocabulary,Input processing free text based or a combination of multiple system Translation/ Machine techniques. The controlled vocabulary is the first and translation traditional technique widely used in libraries andPOS tagging Transliteration documentation centres. Documents are indexed Corpus manually using fixed terms which are also used for Bilingual queries. However, this approach remains limited to application whose vocabulary is still manageable. Information dictionary The efficiency and effectiveness degrade radically Retrieval when size of vocabulary increases. The alternate approach/way to controlled vocabulary is to use the words which appear in the Figure2: Major Steps in Query Translation documents themselves as the vocabulary, such Approach systems are referred as free text retrieval systems. Normally user query is short hence ambiguity 3.1 Machine Translation approach problem arises; to overcome this drawback the query Machine Translation (MT) systems that can be expanded using Word Net/Ontology. Fatiha investigates the use of software to translate text or sadat [2] states that a combination of query speech from one natural language to another. The expansions before and after query translation will main idea of MT system improve the precision of Information Retrieval as well as recall. Translation of query can be done 1037 | P a g e
  3. 3. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Table1: Overview of machine Translation system Systems Year Organization/Institute Domain Anusaaraka 1995 IIT kanpur Children stories Mantra 1999 C-DAC,Bangalore Rajya sabha Secretariat (official circulars) Matra 2004 C-DAC,Mumbai News stories Angla Bharti 1991 IIT kanpur Customizationis to carry out a translation without aid of human 3.3 Corpus based approachassistance. Every Machine translation system A Corpus is a repository of a collection ofrequires programs for translation, automated natural language material, it analyzes largedictionaries and grammars to support translation. collections of existing texts (corpora) andThere are different types of MT exists they are automatically extracts the information needed onbased on direct MT method, Interlingua, transfer which the translation will be based. The alignedmethod and Empirical machine translation corpus contains text samples in one language andapproach. In India machine Translation systems their translations into other language are alignedhave been developed for translation from English to sentence by sentence, word by word, document byIndian languages [4] is shown in table 1. document or even character by character. CorpusDocument translation using MT system is may contain same language contents from sameexpensive and it is not suitable for large collections domain called Monolingual corpora or unalignedand possibly many query languages. Query corpora.translation using MT system does not context for A parallel corpus is a collection whichaccurate translation, it is inadequate for good may contain documents and their translations inCLTR. more than one language; they are translation-3.2 Dictionary-based approach equivalent pairs. Actually a source language query Dictionary approach is used to translate is not translated; it can be matched against thethe query, the basic idea in dictionary-based cross source language component of bilingual parallellanguage text retrieval is to replace each term in the corpus. Then target language component aligned toquery with an appropriate term or set of terms in it can be easily retrieved. Parallel corpora can bethe desired language. CLTR depends on the quality created using human translation, websites in moreand coverage of dictionary, in manually created than one language or using MT methods. “Spider”bilingual dictionary has good quality but poor systems have been developed to collect documentscoverage. Dictionaries are electronic versions of that have translation equivalents over the internet toprinted dictionaries and may be general dictionaries produce corpora [5]. The alignment process can beor specific domain dictionaries or a combination of done by comparing documents by the presence ofboth. The major problems of dictionary based indicators and to construct a parallel corpus is veryapproach are translation ambiguity, out-of- difficult because they require more formalvocabulary terms, word inflection and phrase arrangements. The indicator could be an authoridentification. The below Fig 3 shows the overall name, document date, source, special names in theprocess in dictionary based translation. document, numbers or acronyms, in fact anything which clearly corresponds in both the source and target language texts. Erbug celebi [6] used Bilingual bilingual parallel corpus consists of 1056 Turkish User Translation Query dictionary and 1056 English parallel documents for their experiment sample English to Turkish document shown below table 2. A comparable corpus is a document Translated query in target collection in which documents are aligned based on language the similarity between the topics which they address; they are content-equivalent documentFigure 3: Dictionary based query translation pairs. Corpora is hard to maintain, it tend to be domain/application dependent to achieve effective performance. 1038 | P a g e
  4. 4. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Table2: Example Document English to where f(qi,D) is the term frequency of qi in D, |D| isTurkish parallel corpus length of document D, k1 & b are free parameters to be set, avgdl is the average length of document in corpus, N is the total no. of documents in collection, n(qi) is the number of documents containing qi. 4.2 Language modeling algorithm The likelihood that a given document d will generate a given query q is used to score candidate documents, and is given in equation (3) (3)3.4 Ontology-based approachOntology defines the basic terms and relationscomprising the vocabulary of a topic area as well asthe rules for combining terms and relations todefine extensions to the vocabulary. Ontology is anexplicit specification of a conceptualizationOntology‟s can be implemented in translationsystems to extract conceptual relations formonolingual and cross language IR. 4.3 TFIDF model Sarawathi [7] used Ontological tree for The tfidf weight (term frequency–inversetheir analysis and keyword retrieval, any number document frequency) is a numerical statistic whichlanguages can be used without restriction. It reflects how important a word is to a document in arequires only single mapping from any language to collection or corpus. The term count in the givenany other language. It can also used for other document is simply the number of times apurposes such as keywords language identification given term appears in that document. This count isand sub keyword extraction. usually normalized to prevent a bias towards longer documents (which may have a higher term count4. RANKING METHODS IN regardless of the actual importance of that term inINFORMATION RETRIEVAL the document) to give a measure of the importance An efficient ranking algorithm is essential of the term t, within the particular document d.in any information retrieval system. The role of Thus we have the term frequency tf(t,d).ranking algorithms is to select the documents that The inverse document frequency is a measure ofare most likely be able to satisfy the user needs and whether the term is common or rare across allbring them in top positions. documents. It is obtained by dividing the totalSome of the common used ranking methods in number of documents by the number of documentscross language information retrieval are discussed containing the term, and then takingbelow. Michael Sperriosu [8] compared the Okapi the logarithm of that quotient is given by equationBM25 model and Language Modelling (LM) (4).algorithm says that on a very shallow level Okapioutperforms LM. By reviewing simple TFIDF (4)based ranking algorithm may not result in effectiveCLTR systems for Indian language Queries [9]. with4.1 Okapi BM25 model |D|: Cardinality of D, or the total number ofConsider a user query Q = {q1, q2,....qn} and documents in the corpusdocument D, the BM25 score of the document D is : Number of documentsas given in (1) and (2): where the term appears (i.e.,tf(t,d)≠0 ). If the term is not in the corpus, this will lead to a (1) division-by-zero. It is therefore common to adjust the formula to . (2) Then the tf*idf is calculated using equation (5). 1039 | P a g e
  5. 5. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 NTCIR Asian Language Retrieval, and Text (5) Retrieval Evaluation Conference (TREC) etc. In4.4 Page rank algorithm this paper, different translations and their various Page Rank is a probability distribution used to approaches with merits and demerits are statedrepresent the likelihood that a person randomly clearly. The most widely used ranking methods inclicking on links will arrive at any particular page. Information Retrieval are discussed and CrossPageRank can be calculated for collections of Language Text retrieval in different languages indocuments of any size. A probability is expressed different domain is compared. Throughas a numeric value between 0 and 1. Hence, a investigation of previous research work most of thePageRank of 0.5 means there is a 50% chance that paper carried out query translation and somea person clicking on a random link will be directed researchers have used hybrid approach to achieveto the document with the 0.5 PageRank. In the query translation and attain acceptable results. Ingeneral case, the PageRank value for any most of research works, document translation is notpage „u‟ can be expressed in equation (6): feasible because of the size of translation. Based on (6) the review, the Okapi BM25 ranking model slightly , outperformed LM algorithm and simple TFIDFi.e. the Page Rank value for a page u is dependent model is not suitable for Indian language queries.on the Page Rank values for each page v contained Looking at the statistics of languages on thein the set Bu (the set containing all pages linking to Internet it seems that there is a huge market forpage u), divided by the number L(v) of links from cross-language information retrieval products.page v. REFERENCES5. EVALUATION MEASURES [1] Tao zhang and Yue-Jie zhang, “ResearchPrecision and Recall are used to evaluate the on Chinese-English CLIR” 2008effectiveness of the CLIR system. International conference on machine learning and cybernetics, volume-5.5.1 Precision [2] Fatiha Sadat et al, “Cross language It is defined as the number of relevant Information Retrieval Via Dictionary-documents retrieved by a search divided by the based and Stastistics-based Methods”total number of documents retrieved by that search ,Conference on Communication,is given in equation (7). Precision takes all Computer and Signal Processing ,2001.retrieved documents into account, but it can also be [3] Peter schauble, “CLEF 2000 State of Art:evaluated at a given cut-off rank, considering only Multilingual Information Access”,the topmost results. Eurospider Information Technology AG (7) [4] Sanjay Kumar Dwivedi et al,” Machine Translation System in Indian Perspectives”, Journal of Computer5.2 Recall Science, 2010. It is defined as the number of relevant [5] Mustafa Abusalah.., “Literature Review ofdocuments retrieved by a search divided by the Cross language Information Retrieval”,total number of existing relevant documents is World Academy of science, engineeringgiven in equation (8). Recall in IR is the fraction of and technology,2005.the documents that are relevant to the query that are [6] Erbug Celebi, Baturman Sen, Buraksuccessfully retrieved. It can be looked as the Gunel, “Turkish – English Cross Languageprobability that a relevant document is retrieved by Information Retrieval using LSI”the query. International Symposium on Computer (8) and Information Sciences, 2009. [7] Saraswathi et al, “BiLingual Information Retrieval System for English and Tamil”, Journal of Computing, April 2010.The comparison of various translations and [8] Michael Speriosu et al, “Comparision oftechniques adopted for different languages in Okapi BM25 ansd Language Modelingdifferent domain are given in table 3. Algorithm for NTCIR-6”,Manuscript, Just Systems Corporation, September 2006.6. CONCLUSION In the past five years, research in Cross [9] Sivaji Bandyopadhyay, Tapabrata Mondal,Lingual Information Access has been vigorously Sudip Kumar Naskar, Asif Ekbal,pursued through several international forums, such Rejwanul Haque, Srinivasa Raoas, the Cross-Language Evaluation Forum (CLEF), 1040 | P a g e
  6. 6. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Godavarthy, “Bengali, Hindi and Telugu 2007 “Advances in Multilingual and to English Ad-hoc bilingual task at CLEF Multimodal Information Retrieval. Table 3: Comparison of different translations and their approaches in CLTR.Authors Languag Docume Domain Indexing Translatio Trans- Query Ranking Results for CLIR e query nt Unit n literation expansio /Retrieval language nTao Chinese English AP Word Bilingual No synonym MIT‟s MAPzhang and Newswi based dictionary dictionar method, E-EIR: 0.3187and English re 88-90 segmentatio y probabilisti C-ECLIR: 0.2833Yue-Jie n and n- c methodzhang gram based and approach automatic relevance feedbackAnurag English Hindi Newspa No Shabdanjal No No Monolingu AverageSeetha and hindi pers i bilingual al retrieval Precision:Monolinguaet al 2003- dictionary system l:0.5318 2004 CLIR:0.3446Vivek English English Allahab Stemming Dictionary Hash No Vector Change in the valuesPemawa and hindi and hindi ad and stop database datastruct based of precision andt et al museum word ure model and recall as number of removal mapping Google documents increases. API(docum ent translation)Sivaji Bengali, English Los Porter Bilingual modified yes TFIDF The system performsBandyo Hindi Angeles Stemmer,n- dictionary joint model best for the Telugupadhyay and Times gram source- followed by Hindiet al telugu of 2002 indexing channel and Bengali. and zonal model indexingZhang English Chinese Random No Bilingual No No Vector The proposed methodXiao-fei chinese parallel space outperforms purelyet al web corpus model dictionary based pages baseline by 14.8%Fatiha French English TREC Porter Bilingual No Interactiv NAMAZU Different combinationSadat et and volume stemmer dictionary, e retrieval of query expansional English 1 and 2 and stop statistic relevance system before and after collecti words based feedback, translation with an on method a Domain OR operator shown a and Feedback best average precision Parallel and with 99.13% of Corpora. similarity monolingual thesaurus performance. .Jagadee Hindi, English Los Stop word Bilingual yes no Language CLIR performance:sh tamil,telu Angeles removal statistical Modeling 73% of monolingualJagarla gu,Benga Times and porter dictionary systemmudi et li and stemmer and wordal Marathi by word translationPrasad Hindi English Los Lucene TFIDF yes no Vector Hybrid BooleanPingali and Angeles Framework algorithm based formulation improveset al telugu Times in ranking ranking of documents 2002 combinatio using a n with variant of Bilingual TFIDF 1041 | P a g e
  7. 7. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 dictionary ranking algorithmErbug Turkish Turkish Skylife Porter Parallel No No Latent It increases theCelebi and and Magazi stemmer corpus Semantic retrieval performanceet al English English ne and indexing 3 times when the Longest- direct matching is match considered. stemming algorithmMoham Arabic English Standar Porter ALKAFI No No AIRE In this methodmed d TREC stemmer Machine search description field isAljlayl -7 and 9 and K-stem Translatio engine more effective thanet al collecti algorithm n system title and narrative. onsAri Finnish English TREC K-stem and Medical No No INQUERY This method able toPrikola and medicin TWOL dictionary retrieval achieve performance English e and morphologi and system level of monolingual health cal analyser general system, if the queries related dictionary are structured. topicsAntony English Kannada( Indian Segmentati Alligned Sequence no no Comparison withP.J, et al target place on,Romani parallel labelling Google Indic – word) names zation corpus method,S Transliteration VM proposed model gives kernel better resultsPattabhi Tamil English The Tamil Bilingual Statistical WordNet Okapi MAP:0.3980 RecallR.K Rao and telegrap Morphologi dictionary method and BM25 precision:0.3742T and English h cal DescriptiSobha. analyzer,Lu on FieldL cene Indexer and porter stemmerManoj Hindi , English Los Stemmer Bilingual Devanaga No Okapi MAP Monolingualkumar Marathi Angles and dictionary ri to BM25 IR:0.4402Chinnak and Times Morphologi English Hindi tootla et al English 2002 cal Transliter English:0.2952 & Analyzer ation Marathi to English:0.2163Saravan Tamil,En English The Porter Probabilist Machine No Language MAP Monolingualan et al glish, telegrap Stemmer ic lexicon Transliter Modeling IR:0.5133 Hindi h and and ation, Hindi to Eng:0.4977 Alignment Parallel Transliter & Tamil to Model Corpora & ation Eng:0.4145 Bilingual Similarity dictionary ModelB.Manik Tamil English Random Stopword Bilingual yes No Summarizat It finds the efficientandan webpag removal dictionary ion strategy to implementand es and technique query translation. InDr.R.Sh Stemmer future the algorithmriram will be tested for more parameters.R.Shrira Tamil English On sales Stop word Lexicon no WordNet Data The proposedm and system list and and mining approach performsVijayan Porter Ontology methods(ca slightly higher thanSuguma Stemmer tegorization traditional approach.ran and aggregation 1042 | P a g e
  8. 8. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 )S.Then Tamil English Agricult Morphologi Machine Named WordNet No MAP:95% ofmozhi and ure cal Translatio Entity monolingual systemand English Analyzer, n, RecognizC.Aravi POS tagger Bilingual erndan dictionary and Local word reorderingSaraswa Tamil Tamil Festival Tamil Machine No Ontology Page rank Tamil Increased bythi et al and and morphologi Translatio 60%. English English English cal n increased by 40% analyzer, POS tagger and porter stemmerChaware Hindi, English Shoppin Text to No Char by No - Efficiency dependsand Gujarathi g mall phonetic char, char on minimum numberSrikanth and algorithm to ASCII of keys to be mapped.a Rao Marathi mappingNikolao Greek Greek Hellenic Parallel no no LSI-SOM It performs very wells and and history corpus on experimentsAmpazis English English on theet al Internet websiteMichel French English Hansard No Machine no no LSI It performs quite wellL.Littma and and collecti translation- in CL-LSI.n et al English English on LSI, 1986- Crosslangu 1989(C age-LSI anadian parliam ent proceed ings) 1043 | P a g e