SlideShare a Scribd company logo
1 of 8
Download to read offline
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                   (IJERA)             ISSN: 2248-9622       www.ijera.com
                     Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                    Cross Language Text Retrieval: A Review
                                P.ISWARYA*, Dr.V.RADHA**
   *(Department of Computer Science, Avinashilingam Institute for Home science and Higher Education for
                                        Women, Coimbatore-43)
  ** (Department of Computer science, Avinashilingam Institute for Home science and Higher Education for
                                        Women, Coimbatore-43)

ABSTRACT
         The World Wide Web has evolved into a          information should be easily accessible and
tremendous source of information and it                 digestible. With 100 million internet users, India is at
continues to grow at exponent rate. Now a day’s         3 rd place globally in usage of internet. Though the
web servers are storing different types of contents     network shrank the globe, the language
in different languages and their usage is               diversification is a great barrier to attain full benefit
increasing rapidly. According to Online                 of the digital life. Hence there is a need to develop a
Computer Library Center, English is still a             technique like Cross Language Text Retrieval which
dominant language in the Web. Only small                is used to retrieve text documents in a language other
percentage of population is familiar with English       than the user used to specify the query. Therefore
language and they can express their queries in          Internet is no longer monolingual and non English
English to access the content in a right way. Due       contents are accessed rapidly. There are three
to globalization, content storage and retrieval         different types of AdHoc CLTR which are as
must be possible in all languages, which is             follows:
essential for developing nations like India.                 Monolingual Information Retrieval System –
Diversity of languages is becoming great barrier                 refers to Information Retrieval System that
to understand and enjoy the benefits in digital                  can find relevant documents in the same
world. Cross Language Information Retrieval                      language as the query was expressed.
(CLIR) is a subfield of Information retrieval;               Bilingual Information Retrieval System that
user can retrieve the objects in a language                      allows you to querying in one language and
different from the language of query expressed.                  finding documents in another language.
These objects may be text documents, passages,               Multilingual Information Retrieval System-
audio or video and images. Cross Language Text                   allows you to make query in one language
Retrieval (CLTR) is used to return text                          and able to find documents in multiple
documents in a language other than query                         languages.
language. CLTR technique allows crossing the            This paper continues to focus on following sections:
language barrier and accessing the web content          Section 2 explains about different translation types in
in an efficient way. This paper reviews some types      CLTR and Section 3 about how these translations
of translations carried out in CLTR, ranking            can be carried out using different approaches.
methods used in retrieval of documents and some         Section 4 states about various methods used for
of their related works. It also discusses about         ranking the results while retrieving the documents.
various approaches and their evaluation                 Section 5 deals with previous work in CLTR. Finally
measures used in various applications.                  conclusion is presented in Section 6.

Keywords- Ambiguity,      Bilingual,                    2. TYPE OF TRANSLATION IN CLTR
Disambiguation,     Monolingual,    Multilingual,                 The most important issue in Cross
Transliteration.                                        Language Text Retrieval is that queries and
                                                        documents are in different languages. When the user
1. INTRODUCTION                                         pose a query in one language, either query or
         The number of Web users accessing the          document or both translation takes place. These
Internet become increasing day to day because           translations are done using free text and controlled
people can access any kind of required information      vocabulary approaches. The following Fig1 shows
at anytime. Information Retrieval (IR) mainly refers    the overview of CLTR system.
to a process that the user can find required
information or knowledge from corpus including
different kinds of information [1]. Information
Retrieval is the fact that there is vast amount of
garbage that surrounds any useful information; such




                                                                                               1036 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                                (IJERA)             ISSN: 2248-9622       www.ijera.com
                                  Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                   Cross Language Text Retrieval                       using Machine Translation, Bilingual dictionary and
                                                                       corpus resource. Query translation is simpler than
                                                                       translation of whole documents and it is cost
     Query Translation           Document translation
                                                                       efficient too. The performance of the system heavily
                                                                       depends on how the query is translated effectively.
                                                                       But there are several complexities in achieving good
   Controlled Vocabulary                  Free Text
                                                                       query translation they are translation ambiguity,
                                                                       missing terminology, idioms and compound words
                                                                       and untranslatable terms.     The overall process
  Machine                                                              involved in Query translation Process is shown in
                   Dictionary        Ontology         Corpus-based     Fig 2.
  Translation      Based             Based
                                                                       2.2 Document Translation
                                                                       Translation can be done in other way by translating
                Comparable            Monolingual          Parallel    the documents into the language of query and this
                corpora               corpora              Corpora     document translation achieved manually or through
                                                                       various machine translation systems. Translating 400
            Figure 1: Overview of CLTR system                          million non-English web pages of World Wide Web
                                                                       would take 100,000 days (300 years) in one fast PC
            2.1 Query Translation                                      or 1 month in 3600 PC [3]. However, it is an
                     Usually the query is translated into the          expensive job to be done once for each query
            language of target collection of documents. The first      language and most importantly the quality of the
            step involves parsing the natural language query           translation will be much better because documents
            specified by the user in their native language. The        provide much more context for a machine translation
            given query sentence is segmented and indexed using        program to work with. In particular, when it comes
            Morphological analyzer, Part Of Speech tagging,            to minority languages, the cost becomes almost
            Stemming and Stop word removal.                            unbearable.

                                                                       3. VARIOUS APPROACHES FOR QUERY
                                                           WordNet/S
                                                           ynonymn
                                                                       AND DOCUMENT TRANSLATIONS
 User query
                             Query Expansion                                    As mentioned above the query and
                                                                       document can be translated using the following
                                                           Ontology
                                                                       different techniques such as Controlled vocabulary,
Input processing                                                       free text based or a combination of multiple
    system                 Translation/                  Machine       techniques. The controlled vocabulary is the first and
                                                         translation   traditional technique widely used in libraries and
POS tagging                Transliteration
                                                                       documentation centres. Documents are indexed
                                                         Corpus        manually using fixed terms which are also used for
                                                         Bilingual
                                                                       queries. However, this approach remains limited to
                                                                       application whose vocabulary is still manageable.
                                 Information             dictionary    The efficiency and effectiveness degrade radically
                                  Retrieval                            when size of vocabulary increases.
                                                                                The alternate approach/way to controlled
                                                                       vocabulary is to use the words which appear in the
            Figure2: Major Steps in Query Translation
                                                                       documents themselves as the vocabulary, such
            Approach
                                                                       systems are referred as free text retrieval systems.
            Normally user query is short hence ambiguity
                                                                       3.1 Machine Translation approach
            problem arises; to overcome this drawback the query
                                                                                Machine Translation (MT) systems that
            can be expanded using Word Net/Ontology. Fatiha
                                                                       investigates the use of software to translate text or
            sadat [2] states that a combination of query
                                                                       speech from one natural language to another. The
            expansions before and after query translation will
                                                                       main idea of MT system
            improve the precision of Information Retrieval as
            well as recall. Translation of query can be done




                                                                                                            1037 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                   (IJERA)             ISSN: 2248-9622       www.ijera.com
                     Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                               Table1: Overview of machine Translation system
          Systems               Year            Organization/Institute Domain


          Anusaaraka                1995           IIT kanpur                   Children stories

          Mantra                    1999           C-DAC,Bangalore              Rajya sabha Secretariat
                                                                                (official circulars)

          Matra                     2004           C-DAC,Mumbai                 News stories

          Angla Bharti              1991           IIT kanpur                   Customization


is to carry out a translation without aid of human              3.3 Corpus based approach
assistance. Every Machine translation system                             A Corpus is a repository of a collection of
requires programs for translation, automated                    natural language material, it analyzes large
dictionaries and grammars to support translation.               collections of existing texts (corpora) and
There are different types of MT exists they are                 automatically extracts the information needed on
based on direct MT method, Interlingua, transfer                which the translation will be based. The aligned
method and Empirical machine translation                        corpus contains text samples in one language and
approach. In India machine Translation systems                  their translations into other language are aligned
have been developed for translation from English to             sentence by sentence, word by word, document by
Indian languages [4] is shown in table 1.                       document or even character by character. Corpus
Document translation using MT system is                         may contain same language contents from same
expensive and it is not suitable for large collections          domain called Monolingual corpora or unaligned
and possibly many query languages. Query                        corpora.
translation using MT system does not context for                          A parallel corpus is a collection which
accurate translation, it is inadequate for good                 may contain documents and their translations in
CLTR.                                                           more than one language; they are translation-
3.2 Dictionary-based approach                                   equivalent pairs. Actually a source language query
          Dictionary approach is used to translate              is not translated; it can be matched against the
the query, the basic idea in dictionary-based cross             source language component of bilingual parallel
language text retrieval is to replace each term in the          corpus. Then target language component aligned to
query with an appropriate term or set of terms in               it can be easily retrieved. Parallel corpora can be
the desired language. CLTR depends on the quality               created using human translation, websites in more
and coverage of dictionary, in manually created                 than one language or using MT methods. “Spider”
bilingual dictionary has good quality but poor                  systems have been developed to collect documents
coverage. Dictionaries are electronic versions of               that have translation equivalents over the internet to
printed dictionaries and may be general dictionaries            produce corpora [5]. The alignment process can be
or specific domain dictionaries or a combination of             done by comparing documents by the presence of
both. The major problems of dictionary based                    indicators and to construct a parallel corpus is very
approach are translation ambiguity, out-of-                     difficult because they require more formal
vocabulary terms, word inflection and phrase                    arrangements. The indicator could be an author
identification. The below Fig 3 shows the overall               name, document date, source, special names in the
process in dictionary based translation.                        document, numbers or acronyms, in fact anything
                                                                which clearly corresponds in both the source and
                                                                target language texts. Erbug celebi [6] used
                                           Bilingual            bilingual parallel corpus consists of 1056 Turkish
    User              Translation
    Query
                                           dictionary           and 1056 English parallel documents for their
                                                                experiment sample English to Turkish document
                                                                shown below table 2.
                                                                         A comparable corpus is a document
                    Translated query
                        in target                               collection in which documents are aligned based on
                       language                                 the similarity between the topics which they
                                                                address; they are content-equivalent document
Figure 3: Dictionary based query translation                    pairs. Corpora is hard to maintain, it tend to be
                                                                domain/application dependent to achieve effective
                                                                performance.



                                                                                                     1038 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                     (IJERA)             ISSN: 2248-9622       www.ijera.com
                       Vol. 2, Issue 5, September- October 2012, pp.1036-1043
        Table2: Example Document English to                   where f(qi,D) is the term frequency of qi in D, |D| is
Turkish parallel corpus                                       length of document D, k1 & b are free parameters
                                                              to be set, avgdl is the average length of document in
                                                              corpus, N is the total no. of documents in
                                                              collection, n(qi) is the number of documents
                                                              containing qi.
                                                              4.2 Language modeling algorithm
                                                              The likelihood that a given document d will
                                                              generate a given query q is used to score candidate
                                                              documents, and is given in equation (3)


                                                                                                              (3)



3.4 Ontology-based approach
Ontology defines the basic terms and relations
comprising the vocabulary of a topic area as well as
the rules for combining terms and relations to
define extensions to the vocabulary. Ontology is an
explicit specification of a conceptualization
Ontology‟s can be implemented in translation
systems to extract conceptual relations for
monolingual and cross language IR.
                                                              4.3 TFIDF model
         Sarawathi [7] used Ontological tree for
                                                                        The tfidf weight (term frequency–inverse
their analysis and keyword retrieval, any number
                                                              document frequency) is a numerical statistic which
languages can be used without restriction. It
                                                              reflects how important a word is to a document in a
requires only single mapping from any language to
                                                              collection or corpus. The term count in the given
any other language. It can also used for other
                                                              document is simply the number of times a
purposes such as keywords language identification
                                                              given term appears in that document. This count is
and sub keyword extraction.
                                                              usually normalized to prevent a bias towards longer
                                                              documents (which may have a higher term count
4.   RANKING    METHODS                          IN           regardless of the actual importance of that term in
INFORMATION RETRIEVAL                                         the document) to give a measure of the importance
         An efficient ranking algorithm is essential          of the term t, within the particular document d.
in any information retrieval system. The role of              Thus we have the term frequency tf(t,d).
ranking algorithms is to select the documents that            The inverse document frequency is a measure of
are most likely be able to satisfy the user needs and         whether the term is common or rare across all
bring them in top positions.                                  documents. It is obtained by dividing the total
Some of the common used ranking methods in                    number of documents by the number of documents
cross language information retrieval are discussed            containing     the    term,    and    then   taking
below. Michael Sperriosu [8] compared the Okapi               the logarithm of that quotient is given by equation
BM25 model and Language Modelling (LM)                        (4).
algorithm says that on a very shallow level Okapi
outperforms LM. By reviewing simple TFIDF                                                                   (4)
based ranking algorithm may not result in effective
CLTR systems for Indian language Queries [9].                      with
4.1 Okapi BM25 model                                              |D|: Cardinality of D, or the total number of
Consider a user query Q = {q1, q2,....qn} and                     documents in the corpus
document D, the BM25 score of the document D is                                       : Number of documents
as given in (1) and (2):                                          where the term appears (i.e.,tf(t,d)≠0 ). If the
                                                                  term is not in the corpus, this will lead to a
                                                        (1)       division-by-zero. It is therefore common to
                                                                  adjust the formula to
                                                                  .
                                                        (2)       Then the tf*idf is calculated using equation
                                                                  (5).



                                                                                                   1039 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                   (IJERA)             ISSN: 2248-9622       www.ijera.com
                     Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                                                         NTCIR Asian Language Retrieval, and Text
                                                 (5)
                                                         Retrieval Evaluation Conference (TREC) etc. In
4.4 Page rank algorithm
                                                         this paper, different translations and their various
 Page Rank is a probability distribution used to
                                                         approaches with merits and demerits are stated
represent the likelihood that a person randomly
                                                         clearly. The most widely used ranking methods in
clicking on links will arrive at any particular page.
                                                         Information Retrieval are discussed and Cross
PageRank can be calculated for collections of
                                                         Language Text retrieval in different languages in
documents of any size. A probability is expressed
                                                         different    domain     is    compared.      Through
as a numeric value between 0 and 1. Hence, a
                                                         investigation of previous research work most of the
PageRank of 0.5 means there is a 50% chance that
                                                         paper carried out query translation and some
a person clicking on a random link will be directed
                                                         researchers have used hybrid approach to achieve
to the document with the 0.5 PageRank. In the
                                                         query translation and attain acceptable results. In
general case, the PageRank value for any
                                                         most of research works, document translation is not
page „u‟ can be expressed in equation (6):
                                                         feasible because of the size of translation. Based on
                                              (6)        the review, the Okapi BM25 ranking model slightly
                                      ,                  outperformed LM algorithm and simple TFIDF
i.e. the Page Rank value for a page u is dependent       model is not suitable for Indian language queries.
on the Page Rank values for each page v contained        Looking at the statistics of languages on the
in the set Bu (the set containing all pages linking to   Internet it seems that there is a huge market for
page u), divided by the number L(v) of links from        cross-language information retrieval products.
page v.
                                                         REFERENCES
5. EVALUATION MEASURES                                     [1]    Tao zhang and Yue-Jie zhang, “Research
Precision and Recall are used to evaluate the                     on     Chinese-English     CLIR”     2008
effectiveness of the CLIR system.                                 International conference on machine
                                                                  learning and cybernetics, volume-5.
5.1 Precision                                              [2]    Fatiha Sadat et al, “Cross language
         It is defined as the number of relevant                  Information Retrieval Via Dictionary-
documents retrieved by a search divided by the                    based and Stastistics-based Methods”
total number of documents retrieved by that search                ,Conference       on      Communication,
is given in equation (7). Precision takes all                     Computer and Signal Processing ,2001.
retrieved documents into account, but it can also be       [3]    Peter schauble, “CLEF 2000 State of Art:
evaluated at a given cut-off rank, considering only               Multilingual     Information     Access”,
the topmost results.                                              Eurospider Information Technology AG
                                               (7)
                                                           [4]    Sanjay Kumar Dwivedi et al,” Machine
                                                                  Translation     System       in     Indian
                                                                  Perspectives”, Journal of Computer
5.2 Recall
                                                                  Science, 2010.
         It is defined as the number of relevant
                                                           [5]    Mustafa Abusalah.., “Literature Review of
documents retrieved by a search divided by the
                                                                  Cross language Information Retrieval”,
total number of existing relevant documents is
                                                                  World Academy of science, engineering
given in equation (8). Recall in IR is the fraction of
                                                                  and technology,2005.
the documents that are relevant to the query that are
                                                           [6]    Erbug Celebi, Baturman Sen, Burak
successfully retrieved. It can be looked as the
                                                                  Gunel, “Turkish – English Cross Language
probability that a relevant document is retrieved by
                                                                  Information    Retrieval    using     LSI”
the query.
                                                                  International Symposium on Computer
                                                (8)               and Information Sciences, 2009.
                                                           [7]    Saraswathi et al, “BiLingual Information
                                                                  Retrieval System for English and Tamil”,
                                                                  Journal of Computing, April 2010.
The comparison of various translations and
                                                           [8]    Michael Speriosu et al, “Comparision of
techniques adopted for different languages in
                                                                  Okapi BM25 ansd Language Modeling
different domain are given in table 3.
                                                                  Algorithm for NTCIR-6”,Manuscript, Just
                                                                  Systems Corporation, September 2006.
6. CONCLUSION
         In the past five years, research in Cross         [9]    Sivaji Bandyopadhyay, Tapabrata Mondal,
Lingual Information Access has been vigorously                    Sudip Kumar Naskar, Asif Ekbal,
pursued through several international forums, such                Rejwanul     Haque,    Srinivasa   Rao
as, the Cross-Language Evaluation Forum (CLEF),


                                                                                             1040 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                               (IJERA)             ISSN: 2248-9622       www.ijera.com
                                 Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                        Godavarthy, “Bengali, Hindi and Telugu                     2007 “Advances in Multilingual and
                        to English Ad-hoc bilingual task at CLEF                   Multimodal Information Retrieval.

             Table 3: Comparison of different translations and their approaches in CLTR.
Authors    Languag Docume        Domain Indexing           Translatio Trans-        Query              Ranking        Results for CLIR
           e query     nt                   Unit           n            literation expansio            /Retrieval
                       language                                                     n
Tao        Chinese     English   AP         Word           Bilingual    No          synonym            MIT‟s          MAP
zhang      and                   Newswi based              dictionary               dictionar          method,        E-EIR: 0.3187
and        English               re 88-90 segmentatio                               y                  probabilisti   C-ECLIR: 0.2833
Yue-Jie                                     n and n-                                                   c method
zhang                                       gram based                                                 and
                                            approach                                                   automatic
                                                                                                       relevance
                                                                                                       feedback
Anurag     English       Hindi       Newspa     No            Shabdanjal     No           No           Monolingu      Average
Seetha     and hindi                 pers                     i bilingual                              al retrieval   Precision:Monolingua
et al                                2003-                    dictionary                               system         l:0.5318
                                     2004                                                                             CLIR:0.3446
Vivek      English       English     Allahab    Stemming      Dictionary     Hash         No           Vector         Change in the values
Pemawa     and hindi     and hindi   ad         and    stop   database       datastruct                based          of precision and
t et al                              museum     word                         ure                       model and      recall as number of
                                                removal                      mapping                   Google         documents increases.
                                                                                                       API(docum
                                                                                                       ent
                                                                                                       translation)
Sivaji     Bengali,      English     Los        Porter        Bilingual      modified     yes          TFIDF          The system performs
Bandyo     Hindi                     Angeles    Stemmer,n-    dictionary     joint                     model          best for the Telugu
padhyay    and                       Times      gram                         source-                                  followed by Hindi
et al      telugu                    of 2002    indexing                     channel                                  and Bengali.
                                                and zonal                    model
                                                indexing
Zhang      English       Chinese     Random     No            Bilingual      No           No           Vector         The proposed method
Xiao-fei                             chinese                  parallel                                 space          outperforms purely
et al                                web                      corpus                                   model          dictionary      based
                                     pages                                                                            baseline by 14.8%
Fatiha     French        English     TREC       Porter        Bilingual      No           Interactiv   NAMAZU         Different combination
Sadat et   and                       volume     stemmer       dictionary,                 e            retrieval      of query expansion
al         English                   1 and 2    and    stop   statistic                   relevance    system         before     and   after
                                     collecti   words         based                       feedback,                   translation with an
                                     on                       method                      a Domain                    OR operator shown a
                                                              and                         Feedback                    best average precision
                                                              Parallel                    and                         with     99.13%     of
                                                              Corpora.                    similarity                  monolingual
                                                                                          thesaurus                   performance.
                                                                                          .
Jagadee    Hindi,        English     Los        Stop word     Bilingual      yes          no           Language       CLIR performance:
sh         tamil,telu                Angeles    removal       statistical                              Modeling       73% of monolingual
Jagarla    gu,Benga                  Times      and porter    dictionary                                              system
mudi et    li    and                            stemmer       and word
al         Marathi                                            by word
                                                              translation
Prasad     Hindi         English     Los        Lucene        TFIDF          yes          no           Vector         Hybrid       Boolean
Pingali    and                       Angeles    Framework     algorithm                                based          formulation improves
et al      telugu                    Times                    in                                       ranking        ranking of documents
                                     2002                     combinatio                               using    a
                                                              n       with                             variant of
                                                              Bilingual                                TFIDF



                                                                                                                1041 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                                 (IJERA)             ISSN: 2248-9622       www.ijera.com
                                   Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                                                             dictionary                              ranking
                                                                                                     algorithm
Erbug        Turkish    Turkish    Skylife    Porter         Parallel      No            No          Latent         It    increases   the
Celebi       and        and        Magazi     stemmer        corpus                                  Semantic       retrieval performance
et al        English    English    ne         and                                                    indexing       3 times when the
                                              Longest-                                                              direct matching is
                                              match                                                                 considered.
                                              stemming
                                              algorithm
Moham        Arabic     English    Standar    Porter         ALKAFI        No            No          AIRE           In      this    method
med                                d TREC     stemmer        Machine                                 search         description field is
Aljlayl                            -7 and 9   and K-stem     Translatio                              engine         more effective than
et al                              collecti   algorithm      n system                                               title and narrative.
                                   ons
Ari          Finnish    English    TREC       K-stem and     Medical       No            No          INQUERY        This method able to
Prikola      and                   medicin    TWOL           dictionary                              retrieval      achieve performance
             English               e    and   morphologi     and                                     system         level of monolingual
                                   health     cal analyser   general                                                system, if the queries
                                   related                   dictionary                                             are structured.
                                   topics
Antony       English    Kannada(   Indian     Segmentati     Alligned      Sequence      no          no             Comparison         with
P.J, et al              target     place      on,Romani      parallel      labelling                                Google       Indic    –
                        word)      names      zation         corpus        method,S                                 Transliteration
                                                                           VM                                       proposed model gives
                                                                           kernel                                   better results
Pattabhi     Tamil      English    The        Tamil          Bilingual     Statistical   WordNet     Okapi          MAP:0.3980 Recall
R.K Rao      and                   telegrap   Morphologi     dictionary    method        and         BM25           precision:0.3742
T and        English               h          cal                                        Descripti
Sobha.                                        analyzer,Lu                                on Field
L                                             cene
                                              Indexer and
                                              porter
                                              stemmer
Manoj        Hindi ,    English    Los        Stemmer        Bilingual     Devanaga      No          Okapi          MAP      Monolingual
kumar        Marathi               Angles     and            dictionary    ri       to               BM25           IR:0.4402
Chinnak      and                   Times      Morphologi                   English                                  Hindi             to
otla et al   English               2002       cal                          Transliter                               English:0.2952    &
                                              Analyzer                     ation                                    Marathi           to
                                                                                                                    English:0.2163
Saravan      Tamil,En   English    The        Porter         Probabilist   Machine       No          Language       MAP      Monolingual
an et al     glish,                telegrap   Stemmer        ic lexicon    Transliter                Modeling       IR:0.5133
             Hindi                 h          and            and           ation,                                   Hindi to Eng:0.4977
                                              Alignment      Parallel      Transliter                               &       Tamil     to
                                              Model          Corpora &     ation                                    Eng:0.4145
                                                             Bilingual     Similarity
                                                             dictionary    Model
B.Manik      Tamil      English    Random     Stopword       Bilingual     yes           No          Summarizat     It finds the efficient
andan                              webpag     removal        dictionary                              ion            strategy to implement
and                                es         and                                                    technique      query translation. In
Dr.R.Sh                                       Stemmer                                                               future the algorithm
riram                                                                                                               will be tested for
                                                                                                                    more parameters.
R.Shrira     Tamil      English    On sales   Stop word      Lexicon       no            WordNet     Data           The           proposed
m and                              system     list   and     and                                     mining         approach      performs
Vijayan                                       Porter         Ontology                                methods(ca     slightly higher than
Suguma                                        Stemmer                                                tegorization   traditional approach.
ran                                                                                                  and
                                                                                                     aggregation



                                                                                                              1042 | P a g e
P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications
                                (IJERA)             ISSN: 2248-9622       www.ijera.com
                                  Vol. 2, Issue 5, September- October 2012, pp.1036-1043
                                                                                                   )




S.Then      Tamil       English   Agricult   Morphologi     Machine        Named        WordNet    No          MAP:95%            of
mozhi       and                   ure        cal            Translatio     Entity                              monolingual system
and         English                          Analyzer,      n,             Recogniz
C.Aravi                                      POS tagger     Bilingual      er
ndan                                                        dictionary
                                                            and Local
                                                            word
                                                            reordering
Saraswa     Tamil       Tamil     Festival   Tamil          Machine        No           Ontology   Page rank   Tamil Increased by
thi et al   and         and                  morphologi     Translatio                                         60%.          English
            English     English              cal            n                                                  increased by 40%
                                             analyzer,
                                             POS tagger
                                             and porter
                                             stemmer
Chaware     Hindi,      English   Shoppin    Text      to   No             Char by      No         -           Efficiency depends
and         Gujarathi             g mall     phonetic                      char, char                          on minimum number
Srikanth    and                              algorithm                     to ASCII                            of keys to be mapped.
a Rao       Marathi                                                        mapping
Nikolao     Greek       Greek     Hellenic                  Parallel       no           no         LSI-SOM     It performs very well
s           and         and       history                   corpus                                             on experiments
Ampazis     English     English   on the
et al                             Internet
                                  website
Michel      French      English   Hansard    No             Machine        no           no         LSI         It performs quite well
L.Littma    and         and       collecti                  translation-                                       in CL-LSI.
n et al     English     English   on                        LSI,
                                  1986-                     Crosslangu
                                  1989(C                    age-LSI
                                  anadian
                                  parliam
                                  ent
                                  proceed
                                  ings)




                                                                                                         1043 | P a g e

More Related Content

What's hot

WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
 
NL Context Understanding 23(6)
NL Context Understanding 23(6)NL Context Understanding 23(6)
NL Context Understanding 23(6)IT Industry
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
 
Role of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesRole of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesDr. Amit Kumar Jha
 
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACH
K AMBA  P ART  O F  S PEECH  T AGGER  U SING  M EMORY  B ASED  A PPROACHK AMBA  P ART  O F  S PEECH  T AGGER  U SING  M EMORY  B ASED  A PPROACH
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACHijnlc
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urduijaia
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
 
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...DR.P.S.JAGADEESH KUMAR
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
 
Software feasibility study to
Software feasibility study toSoftware feasibility study to
Software feasibility study toijdms
 
A survey on Script and Language identification for Handwritten document images
A survey on Script and Language identification for Handwritten document imagesA survey on Script and Language identification for Handwritten document images
A survey on Script and Language identification for Handwritten document imagesiosrjce
 
Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...journalBEEI
 
Review on feature-based method performance in text steganography
Review on feature-based method performance in text steganographyReview on feature-based method performance in text steganography
Review on feature-based method performance in text steganographyjournalBEEI
 
Traid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography methodTraid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography methodjournalBEEI
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
 

What's hot (20)

WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
 
NL Context Understanding 23(6)
NL Context Understanding 23(6)NL Context Understanding 23(6)
NL Context Understanding 23(6)
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...
 
Role of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesRole of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languages
 
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACH
K AMBA  P ART  O F  S PEECH  T AGGER  U SING  M EMORY  B ASED  A PPROACHK AMBA  P ART  O F  S PEECH  T AGGER  U SING  M EMORY  B ASED  A PPROACH
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACH
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urdu
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
Speech Processing and Audio feedback for Tamil Alphabets using Artificial Neu...
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
Software feasibility study to
Software feasibility study toSoftware feasibility study to
Software feasibility study to
 
A survey on Script and Language identification for Handwritten document images
A survey on Script and Language identification for Handwritten document imagesA survey on Script and Language identification for Handwritten document images
A survey on Script and Language identification for Handwritten document images
 
Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...Analysis review on feature-based and word-rule based techniques in text stega...
Analysis review on feature-based and word-rule based techniques in text stega...
 
Review on feature-based method performance in text steganography
Review on feature-based method performance in text steganographyReview on feature-based method performance in text steganography
Review on feature-based method performance in text steganography
 
Traid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography methodTraid-bit embedding process on Arabic text steganography method
Traid-bit embedding process on Arabic text steganography method
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
 

Viewers also liked

Viewers also liked (20)

Fn2510161020
Fn2510161020Fn2510161020
Fn2510161020
 
Ev25907913
Ev25907913Ev25907913
Ev25907913
 
Fa25939942
Fa25939942Fa25939942
Fa25939942
 
Energia txostena ondo
Energia txostena ondoEnergia txostena ondo
Energia txostena ondo
 
Escola sustentável
Escola sustentávelEscola sustentável
Escola sustentável
 
Procurador del estado
Procurador del estadoProcurador del estado
Procurador del estado
 
Apresentação reunião pública junho 2011
Apresentação reunião pública   junho 2011Apresentação reunião pública   junho 2011
Apresentação reunião pública junho 2011
 
Solucao anpad rq_set_2007
Solucao anpad rq_set_2007Solucao anpad rq_set_2007
Solucao anpad rq_set_2007
 
Linux.softwarelibres
Linux.softwarelibresLinux.softwarelibres
Linux.softwarelibres
 
Speech apresentação 3 t11
Speech   apresentação 3 t11Speech   apresentação 3 t11
Speech apresentação 3 t11
 
Rendicion de cuentas_periodo_2013_1
Rendicion de cuentas_periodo_2013_1Rendicion de cuentas_periodo_2013_1
Rendicion de cuentas_periodo_2013_1
 
Nos otros mi amor...
Nos otros mi amor...Nos otros mi amor...
Nos otros mi amor...
 
Task 5
Task 5Task 5
Task 5
 
Comunidades Virtuais
Comunidades VirtuaisComunidades Virtuais
Comunidades Virtuais
 
Unid1 ativ1 4-celsopf
Unid1 ativ1 4-celsopfUnid1 ativ1 4-celsopf
Unid1 ativ1 4-celsopf
 
Iii. psicología
Iii. psicologíaIii. psicología
Iii. psicología
 
закруткин
закруткинзакруткин
закруткин
 
Teorias do curriculo
Teorias do curriculoTeorias do curriculo
Teorias do curriculo
 
Clase 1
Clase 1Clase 1
Clase 1
 
Acuerdos académicos
Acuerdos académicosAcuerdos académicos
Acuerdos académicos
 

Similar to Cross Language Text Retrieval techniques and approaches review

A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALA SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
 
A Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information RetrievalA Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information Retrievaldannyijwest
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
 
A Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching ResearchA Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching ResearchJim Webb
 
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDMarathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...Scott Bou
 
QUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu LanguageQUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu LanguageIJERA Editor
 
Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...IOSR Journals
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir systemcsandit
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingIOSR Journals
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMijcsa
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phoneseSAT Journals
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phoneseSAT Publishing House
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
 

Similar to Cross Language Text Retrieval techniques and approaches review (20)

A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALA SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
 
A Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information RetrievalA Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information Retrieval
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
 
A SURVEY ON VARIOUS CLIR TECHNIQUES
A SURVEY ON VARIOUS CLIR TECHNIQUESA SURVEY ON VARIOUS CLIR TECHNIQUES
A SURVEY ON VARIOUS CLIR TECHNIQUES
 
C1803021622
C1803021622C1803021622
C1803021622
 
A Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching ResearchA Survey Of Current Datasets For Code-Switching Research
A Survey Of Current Datasets For Code-Switching Research
 
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDMarathi-English CLIR using detailed user query and unsupervised corpus-based WSD
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...
 
QUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu LanguageQUrdPro: Query processing system for Urdu Language
QUrdPro: Query processing system for Urdu Language
 
Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir system
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And Recording
 
INTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAMINTELLIGENT QUERY PROCESSING IN MALAYALAM
INTELLIGENT QUERY PROCESSING IN MALAYALAM
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
An optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phonesAn optimized approach to voice translation on mobile phones
An optimized approach to voice translation on mobile phones
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Cross Language Text Retrieval techniques and approaches review

  • 1. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Cross Language Text Retrieval: A Review P.ISWARYA*, Dr.V.RADHA** *(Department of Computer Science, Avinashilingam Institute for Home science and Higher Education for Women, Coimbatore-43) ** (Department of Computer science, Avinashilingam Institute for Home science and Higher Education for Women, Coimbatore-43) ABSTRACT The World Wide Web has evolved into a information should be easily accessible and tremendous source of information and it digestible. With 100 million internet users, India is at continues to grow at exponent rate. Now a day’s 3 rd place globally in usage of internet. Though the web servers are storing different types of contents network shrank the globe, the language in different languages and their usage is diversification is a great barrier to attain full benefit increasing rapidly. According to Online of the digital life. Hence there is a need to develop a Computer Library Center, English is still a technique like Cross Language Text Retrieval which dominant language in the Web. Only small is used to retrieve text documents in a language other percentage of population is familiar with English than the user used to specify the query. Therefore language and they can express their queries in Internet is no longer monolingual and non English English to access the content in a right way. Due contents are accessed rapidly. There are three to globalization, content storage and retrieval different types of AdHoc CLTR which are as must be possible in all languages, which is follows: essential for developing nations like India.  Monolingual Information Retrieval System – Diversity of languages is becoming great barrier refers to Information Retrieval System that to understand and enjoy the benefits in digital can find relevant documents in the same world. Cross Language Information Retrieval language as the query was expressed. (CLIR) is a subfield of Information retrieval;  Bilingual Information Retrieval System that user can retrieve the objects in a language allows you to querying in one language and different from the language of query expressed. finding documents in another language. These objects may be text documents, passages,  Multilingual Information Retrieval System- audio or video and images. Cross Language Text allows you to make query in one language Retrieval (CLTR) is used to return text and able to find documents in multiple documents in a language other than query languages. language. CLTR technique allows crossing the This paper continues to focus on following sections: language barrier and accessing the web content Section 2 explains about different translation types in in an efficient way. This paper reviews some types CLTR and Section 3 about how these translations of translations carried out in CLTR, ranking can be carried out using different approaches. methods used in retrieval of documents and some Section 4 states about various methods used for of their related works. It also discusses about ranking the results while retrieving the documents. various approaches and their evaluation Section 5 deals with previous work in CLTR. Finally measures used in various applications. conclusion is presented in Section 6. Keywords- Ambiguity, Bilingual, 2. TYPE OF TRANSLATION IN CLTR Disambiguation, Monolingual, Multilingual, The most important issue in Cross Transliteration. Language Text Retrieval is that queries and documents are in different languages. When the user 1. INTRODUCTION pose a query in one language, either query or The number of Web users accessing the document or both translation takes place. These Internet become increasing day to day because translations are done using free text and controlled people can access any kind of required information vocabulary approaches. The following Fig1 shows at anytime. Information Retrieval (IR) mainly refers the overview of CLTR system. to a process that the user can find required information or knowledge from corpus including different kinds of information [1]. Information Retrieval is the fact that there is vast amount of garbage that surrounds any useful information; such 1036 | P a g e
  • 2. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Cross Language Text Retrieval using Machine Translation, Bilingual dictionary and corpus resource. Query translation is simpler than translation of whole documents and it is cost Query Translation Document translation efficient too. The performance of the system heavily depends on how the query is translated effectively. But there are several complexities in achieving good Controlled Vocabulary Free Text query translation they are translation ambiguity, missing terminology, idioms and compound words and untranslatable terms. The overall process Machine involved in Query translation Process is shown in Dictionary Ontology Corpus-based Fig 2. Translation Based Based 2.2 Document Translation Translation can be done in other way by translating Comparable Monolingual Parallel the documents into the language of query and this corpora corpora Corpora document translation achieved manually or through various machine translation systems. Translating 400 Figure 1: Overview of CLTR system million non-English web pages of World Wide Web would take 100,000 days (300 years) in one fast PC 2.1 Query Translation or 1 month in 3600 PC [3]. However, it is an Usually the query is translated into the expensive job to be done once for each query language of target collection of documents. The first language and most importantly the quality of the step involves parsing the natural language query translation will be much better because documents specified by the user in their native language. The provide much more context for a machine translation given query sentence is segmented and indexed using program to work with. In particular, when it comes Morphological analyzer, Part Of Speech tagging, to minority languages, the cost becomes almost Stemming and Stop word removal. unbearable. 3. VARIOUS APPROACHES FOR QUERY WordNet/S ynonymn AND DOCUMENT TRANSLATIONS User query Query Expansion As mentioned above the query and document can be translated using the following Ontology different techniques such as Controlled vocabulary, Input processing free text based or a combination of multiple system Translation/ Machine techniques. The controlled vocabulary is the first and translation traditional technique widely used in libraries and POS tagging Transliteration documentation centres. Documents are indexed Corpus manually using fixed terms which are also used for Bilingual queries. However, this approach remains limited to application whose vocabulary is still manageable. Information dictionary The efficiency and effectiveness degrade radically Retrieval when size of vocabulary increases. The alternate approach/way to controlled vocabulary is to use the words which appear in the Figure2: Major Steps in Query Translation documents themselves as the vocabulary, such Approach systems are referred as free text retrieval systems. Normally user query is short hence ambiguity 3.1 Machine Translation approach problem arises; to overcome this drawback the query Machine Translation (MT) systems that can be expanded using Word Net/Ontology. Fatiha investigates the use of software to translate text or sadat [2] states that a combination of query speech from one natural language to another. The expansions before and after query translation will main idea of MT system improve the precision of Information Retrieval as well as recall. Translation of query can be done 1037 | P a g e
  • 3. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Table1: Overview of machine Translation system Systems Year Organization/Institute Domain Anusaaraka 1995 IIT kanpur Children stories Mantra 1999 C-DAC,Bangalore Rajya sabha Secretariat (official circulars) Matra 2004 C-DAC,Mumbai News stories Angla Bharti 1991 IIT kanpur Customization is to carry out a translation without aid of human 3.3 Corpus based approach assistance. Every Machine translation system A Corpus is a repository of a collection of requires programs for translation, automated natural language material, it analyzes large dictionaries and grammars to support translation. collections of existing texts (corpora) and There are different types of MT exists they are automatically extracts the information needed on based on direct MT method, Interlingua, transfer which the translation will be based. The aligned method and Empirical machine translation corpus contains text samples in one language and approach. In India machine Translation systems their translations into other language are aligned have been developed for translation from English to sentence by sentence, word by word, document by Indian languages [4] is shown in table 1. document or even character by character. Corpus Document translation using MT system is may contain same language contents from same expensive and it is not suitable for large collections domain called Monolingual corpora or unaligned and possibly many query languages. Query corpora. translation using MT system does not context for A parallel corpus is a collection which accurate translation, it is inadequate for good may contain documents and their translations in CLTR. more than one language; they are translation- 3.2 Dictionary-based approach equivalent pairs. Actually a source language query Dictionary approach is used to translate is not translated; it can be matched against the the query, the basic idea in dictionary-based cross source language component of bilingual parallel language text retrieval is to replace each term in the corpus. Then target language component aligned to query with an appropriate term or set of terms in it can be easily retrieved. Parallel corpora can be the desired language. CLTR depends on the quality created using human translation, websites in more and coverage of dictionary, in manually created than one language or using MT methods. “Spider” bilingual dictionary has good quality but poor systems have been developed to collect documents coverage. Dictionaries are electronic versions of that have translation equivalents over the internet to printed dictionaries and may be general dictionaries produce corpora [5]. The alignment process can be or specific domain dictionaries or a combination of done by comparing documents by the presence of both. The major problems of dictionary based indicators and to construct a parallel corpus is very approach are translation ambiguity, out-of- difficult because they require more formal vocabulary terms, word inflection and phrase arrangements. The indicator could be an author identification. The below Fig 3 shows the overall name, document date, source, special names in the process in dictionary based translation. document, numbers or acronyms, in fact anything which clearly corresponds in both the source and target language texts. Erbug celebi [6] used Bilingual bilingual parallel corpus consists of 1056 Turkish User Translation Query dictionary and 1056 English parallel documents for their experiment sample English to Turkish document shown below table 2. A comparable corpus is a document Translated query in target collection in which documents are aligned based on language the similarity between the topics which they address; they are content-equivalent document Figure 3: Dictionary based query translation pairs. Corpora is hard to maintain, it tend to be domain/application dependent to achieve effective performance. 1038 | P a g e
  • 4. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Table2: Example Document English to where f(qi,D) is the term frequency of qi in D, |D| is Turkish parallel corpus length of document D, k1 & b are free parameters to be set, avgdl is the average length of document in corpus, N is the total no. of documents in collection, n(qi) is the number of documents containing qi. 4.2 Language modeling algorithm The likelihood that a given document d will generate a given query q is used to score candidate documents, and is given in equation (3) (3) 3.4 Ontology-based approach Ontology defines the basic terms and relations comprising the vocabulary of a topic area as well as the rules for combining terms and relations to define extensions to the vocabulary. Ontology is an explicit specification of a conceptualization Ontology‟s can be implemented in translation systems to extract conceptual relations for monolingual and cross language IR. 4.3 TFIDF model Sarawathi [7] used Ontological tree for The tfidf weight (term frequency–inverse their analysis and keyword retrieval, any number document frequency) is a numerical statistic which languages can be used without restriction. It reflects how important a word is to a document in a requires only single mapping from any language to collection or corpus. The term count in the given any other language. It can also used for other document is simply the number of times a purposes such as keywords language identification given term appears in that document. This count is and sub keyword extraction. usually normalized to prevent a bias towards longer documents (which may have a higher term count 4. RANKING METHODS IN regardless of the actual importance of that term in INFORMATION RETRIEVAL the document) to give a measure of the importance An efficient ranking algorithm is essential of the term t, within the particular document d. in any information retrieval system. The role of Thus we have the term frequency tf(t,d). ranking algorithms is to select the documents that The inverse document frequency is a measure of are most likely be able to satisfy the user needs and whether the term is common or rare across all bring them in top positions. documents. It is obtained by dividing the total Some of the common used ranking methods in number of documents by the number of documents cross language information retrieval are discussed containing the term, and then taking below. Michael Sperriosu [8] compared the Okapi the logarithm of that quotient is given by equation BM25 model and Language Modelling (LM) (4). algorithm says that on a very shallow level Okapi outperforms LM. By reviewing simple TFIDF (4) based ranking algorithm may not result in effective CLTR systems for Indian language Queries [9]. with 4.1 Okapi BM25 model |D|: Cardinality of D, or the total number of Consider a user query Q = {q1, q2,....qn} and documents in the corpus document D, the BM25 score of the document D is : Number of documents as given in (1) and (2): where the term appears (i.e.,tf(t,d)≠0 ). If the term is not in the corpus, this will lead to a (1) division-by-zero. It is therefore common to adjust the formula to . (2) Then the tf*idf is calculated using equation (5). 1039 | P a g e
  • 5. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 NTCIR Asian Language Retrieval, and Text (5) Retrieval Evaluation Conference (TREC) etc. In 4.4 Page rank algorithm this paper, different translations and their various Page Rank is a probability distribution used to approaches with merits and demerits are stated represent the likelihood that a person randomly clearly. The most widely used ranking methods in clicking on links will arrive at any particular page. Information Retrieval are discussed and Cross PageRank can be calculated for collections of Language Text retrieval in different languages in documents of any size. A probability is expressed different domain is compared. Through as a numeric value between 0 and 1. Hence, a investigation of previous research work most of the PageRank of 0.5 means there is a 50% chance that paper carried out query translation and some a person clicking on a random link will be directed researchers have used hybrid approach to achieve to the document with the 0.5 PageRank. In the query translation and attain acceptable results. In general case, the PageRank value for any most of research works, document translation is not page „u‟ can be expressed in equation (6): feasible because of the size of translation. Based on (6) the review, the Okapi BM25 ranking model slightly , outperformed LM algorithm and simple TFIDF i.e. the Page Rank value for a page u is dependent model is not suitable for Indian language queries. on the Page Rank values for each page v contained Looking at the statistics of languages on the in the set Bu (the set containing all pages linking to Internet it seems that there is a huge market for page u), divided by the number L(v) of links from cross-language information retrieval products. page v. REFERENCES 5. EVALUATION MEASURES [1] Tao zhang and Yue-Jie zhang, “Research Precision and Recall are used to evaluate the on Chinese-English CLIR” 2008 effectiveness of the CLIR system. International conference on machine learning and cybernetics, volume-5. 5.1 Precision [2] Fatiha Sadat et al, “Cross language It is defined as the number of relevant Information Retrieval Via Dictionary- documents retrieved by a search divided by the based and Stastistics-based Methods” total number of documents retrieved by that search ,Conference on Communication, is given in equation (7). Precision takes all Computer and Signal Processing ,2001. retrieved documents into account, but it can also be [3] Peter schauble, “CLEF 2000 State of Art: evaluated at a given cut-off rank, considering only Multilingual Information Access”, the topmost results. Eurospider Information Technology AG (7) [4] Sanjay Kumar Dwivedi et al,” Machine Translation System in Indian Perspectives”, Journal of Computer 5.2 Recall Science, 2010. It is defined as the number of relevant [5] Mustafa Abusalah.., “Literature Review of documents retrieved by a search divided by the Cross language Information Retrieval”, total number of existing relevant documents is World Academy of science, engineering given in equation (8). Recall in IR is the fraction of and technology,2005. the documents that are relevant to the query that are [6] Erbug Celebi, Baturman Sen, Burak successfully retrieved. It can be looked as the Gunel, “Turkish – English Cross Language probability that a relevant document is retrieved by Information Retrieval using LSI” the query. International Symposium on Computer (8) and Information Sciences, 2009. [7] Saraswathi et al, “BiLingual Information Retrieval System for English and Tamil”, Journal of Computing, April 2010. The comparison of various translations and [8] Michael Speriosu et al, “Comparision of techniques adopted for different languages in Okapi BM25 ansd Language Modeling different domain are given in table 3. Algorithm for NTCIR-6”,Manuscript, Just Systems Corporation, September 2006. 6. CONCLUSION In the past five years, research in Cross [9] Sivaji Bandyopadhyay, Tapabrata Mondal, Lingual Information Access has been vigorously Sudip Kumar Naskar, Asif Ekbal, pursued through several international forums, such Rejwanul Haque, Srinivasa Rao as, the Cross-Language Evaluation Forum (CLEF), 1040 | P a g e
  • 6. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 Godavarthy, “Bengali, Hindi and Telugu 2007 “Advances in Multilingual and to English Ad-hoc bilingual task at CLEF Multimodal Information Retrieval. Table 3: Comparison of different translations and their approaches in CLTR. Authors Languag Docume Domain Indexing Translatio Trans- Query Ranking Results for CLIR e query nt Unit n literation expansio /Retrieval language n Tao Chinese English AP Word Bilingual No synonym MIT‟s MAP zhang and Newswi based dictionary dictionar method, E-EIR: 0.3187 and English re 88-90 segmentatio y probabilisti C-ECLIR: 0.2833 Yue-Jie n and n- c method zhang gram based and approach automatic relevance feedback Anurag English Hindi Newspa No Shabdanjal No No Monolingu Average Seetha and hindi pers i bilingual al retrieval Precision:Monolingua et al 2003- dictionary system l:0.5318 2004 CLIR:0.3446 Vivek English English Allahab Stemming Dictionary Hash No Vector Change in the values Pemawa and hindi and hindi ad and stop database datastruct based of precision and t et al museum word ure model and recall as number of removal mapping Google documents increases. API(docum ent translation) Sivaji Bengali, English Los Porter Bilingual modified yes TFIDF The system performs Bandyo Hindi Angeles Stemmer,n- dictionary joint model best for the Telugu padhyay and Times gram source- followed by Hindi et al telugu of 2002 indexing channel and Bengali. and zonal model indexing Zhang English Chinese Random No Bilingual No No Vector The proposed method Xiao-fei chinese parallel space outperforms purely et al web corpus model dictionary based pages baseline by 14.8% Fatiha French English TREC Porter Bilingual No Interactiv NAMAZU Different combination Sadat et and volume stemmer dictionary, e retrieval of query expansion al English 1 and 2 and stop statistic relevance system before and after collecti words based feedback, translation with an on method a Domain OR operator shown a and Feedback best average precision Parallel and with 99.13% of Corpora. similarity monolingual thesaurus performance. . Jagadee Hindi, English Los Stop word Bilingual yes no Language CLIR performance: sh tamil,telu Angeles removal statistical Modeling 73% of monolingual Jagarla gu,Benga Times and porter dictionary system mudi et li and stemmer and word al Marathi by word translation Prasad Hindi English Los Lucene TFIDF yes no Vector Hybrid Boolean Pingali and Angeles Framework algorithm based formulation improves et al telugu Times in ranking ranking of documents 2002 combinatio using a n with variant of Bilingual TFIDF 1041 | P a g e
  • 7. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 dictionary ranking algorithm Erbug Turkish Turkish Skylife Porter Parallel No No Latent It increases the Celebi and and Magazi stemmer corpus Semantic retrieval performance et al English English ne and indexing 3 times when the Longest- direct matching is match considered. stemming algorithm Moham Arabic English Standar Porter ALKAFI No No AIRE In this method med d TREC stemmer Machine search description field is Aljlayl -7 and 9 and K-stem Translatio engine more effective than et al collecti algorithm n system title and narrative. ons Ari Finnish English TREC K-stem and Medical No No INQUERY This method able to Prikola and medicin TWOL dictionary retrieval achieve performance English e and morphologi and system level of monolingual health cal analyser general system, if the queries related dictionary are structured. topics Antony English Kannada( Indian Segmentati Alligned Sequence no no Comparison with P.J, et al target place on,Romani parallel labelling Google Indic – word) names zation corpus method,S Transliteration VM proposed model gives kernel better results Pattabhi Tamil English The Tamil Bilingual Statistical WordNet Okapi MAP:0.3980 Recall R.K Rao and telegrap Morphologi dictionary method and BM25 precision:0.3742 T and English h cal Descripti Sobha. analyzer,Lu on Field L cene Indexer and porter stemmer Manoj Hindi , English Los Stemmer Bilingual Devanaga No Okapi MAP Monolingual kumar Marathi Angles and dictionary ri to BM25 IR:0.4402 Chinnak and Times Morphologi English Hindi to otla et al English 2002 cal Transliter English:0.2952 & Analyzer ation Marathi to English:0.2163 Saravan Tamil,En English The Porter Probabilist Machine No Language MAP Monolingual an et al glish, telegrap Stemmer ic lexicon Transliter Modeling IR:0.5133 Hindi h and and ation, Hindi to Eng:0.4977 Alignment Parallel Transliter & Tamil to Model Corpora & ation Eng:0.4145 Bilingual Similarity dictionary Model B.Manik Tamil English Random Stopword Bilingual yes No Summarizat It finds the efficient andan webpag removal dictionary ion strategy to implement and es and technique query translation. In Dr.R.Sh Stemmer future the algorithm riram will be tested for more parameters. R.Shrira Tamil English On sales Stop word Lexicon no WordNet Data The proposed m and system list and and mining approach performs Vijayan Porter Ontology methods(ca slightly higher than Suguma Stemmer tegorization traditional approach. ran and aggregation 1042 | P a g e
  • 8. P.Iswarya, Dr.V.Radha / International Journal Of Engineering Research And Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.1036-1043 ) S.Then Tamil English Agricult Morphologi Machine Named WordNet No MAP:95% of mozhi and ure cal Translatio Entity monolingual system and English Analyzer, n, Recogniz C.Aravi POS tagger Bilingual er ndan dictionary and Local word reordering Saraswa Tamil Tamil Festival Tamil Machine No Ontology Page rank Tamil Increased by thi et al and and morphologi Translatio 60%. English English English cal n increased by 40% analyzer, POS tagger and porter stemmer Chaware Hindi, English Shoppin Text to No Char by No - Efficiency depends and Gujarathi g mall phonetic char, char on minimum number Srikanth and algorithm to ASCII of keys to be mapped. a Rao Marathi mapping Nikolao Greek Greek Hellenic Parallel no no LSI-SOM It performs very well s and and history corpus on experiments Ampazis English English on the et al Internet website Michel French English Hansard No Machine no no LSI It performs quite well L.Littma and and collecti translation- in CL-LSI. n et al English English on LSI, 1986- Crosslangu 1989(C age-LSI anadian parliam ent proceed ings) 1043 | P a g e