SlideShare a Scribd company logo
1 of 6
Download to read offline
Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91
www.ijera.com 86 | P a g e
Marathi-English CLIR using detailed user query and
unsupervised corpus-based WSD
Savita C. Mayanale*, Ms. S. S. Pawar**
* (M.E. Student, Dept. of Computer Engg., D. Y. Patil College of Engg., Akurdi, Savitribai Phule Pune
University, India.)
** (Asst. Prof., Dept. of Computer Engg., D. Y. Patil College of Engg., Akurdi, Savitribai Phule Pune
University, India.)
ABSTRACT
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is
becoming need of the day. It helps user to query in their native language and retrieve information in any
language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity,
mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for
improving the performance of Marathi-English CLIR system. The system first finds possible translations of input
query in target language, disambiguates them and then gives English queries to search engine for relevant
document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English
dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval
Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed
approach gives better performance of Marathi-English CLIR system with good precision level.
Keywords - Morphological Analyser, CLIR, Dictionary-based Approach, word sense disambiguation.
I. INTRODUCTION
Information Retrieval (IR) refers to a process
that users can find their required information or
knowledge from corpus including different kinds of
information. Traditional IR system is monolingual
system. However, with the rapid growing amount of
information available to us, the situations that a user
needs to use a retrieval system to perform querying a
multilingual document collection are becoming
increasingly emerging and common. This tendency
causes the difficulty of information acquisition.
Meanwhile, language barriers become a serious
problem. As a result CLIR has received more
research attention and is increasingly being used to
retrieve information on the Internet.
India has multiple languages which are written
in different scripts. Most of the people in India speak
in their regional language whereas vast amount of
information on the Internet is available in English.
However, users who do not use English as first
language are also significantly high in number. Such
non English users find it difficult to express the
query in appropriate English language. Proficiency
in English language is becoming a kind of barrier in
finding rich source of information available on
World Wide Web. CLIR helps in bridging this gap.
CLIR systems allow users to query in their native
language and get relevant documents in other
languages.
Translations in CLIR can be achieved by two
ways: Query translation or Document translation
[16]. In Query Translation, the given query will be
translated from native language to target language
and then search operation is performed to get the
relevant documents. In document translation, all the
documents from corpus are translated into native
language. It allows the user to ask query in native
language and then the searching will take place to
obtain the relevant documents in native language.
Among the two, the query translation is easier
compared with document translation because of the
size of translation [8]. But, the drawback with query
translation is that the given query normally will be
short and hence ambiguity problem may arise.
These translation methods can be further
classified into three classes according to what
resources are used to cross the language boundary as
Machine Translation (MT), dictionary based and
parallel corpora based. In [1], CLIR experiments
were conducted using Google Translator for
translating queries to study the effectiveness of
CLIR using MT for query translation. In [5] and
[13], CLIR systems were implemented using
bilingual dictionaries. Parallel or comparable
corpora are useful resources to extract beneficial
information for CLIR [14].
The performance of CLIR system mainly
depends on accurate translation of users input query.
But, most of the times user's query contains
ambiguous terms which results in retrieval of
irrelevant documents also along-with relevant
RESEARCH ARTICLE OPEN ACCESS
Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91
www.ijera.com 87 | P a g e
documents. This reduces the overall precision. In
order to improve the retrieval efficiency; Word
Sense Disambiguation (WSD) is used. WSD
approaches are classified in two major categories
depending on the corpora they use. Supervised
approaches need sense tagged corpora for training
the model whereas unsupervised approaches work
on untagged corpora. There are some semi-
supervised approaches which work with small
annotated seed data. Another way to classify WSD
approaches is based on the use of knowledge
resources. Approaches using knowledge resources
like wordnet, ontologies, thesauri are termed as
knowledge based approaches, while there are some
approaches which work without any of these
resources, which are termed as knowledge-lean or
corpus based approaches.
II. RELATED WORK
The Kazuaki Kishida [2] in 2004 reviewed
techniques and methods for enhancing effectiveness
of cross-language information retrieval. They have
covered research issues such as (i) matching
strategies and translation techniques, (ii) methods for
solving the problem of translation ambiguity, (iii)
formal models for CLIR such as application of the
language model, (iv) the pivot language approach,
(v) methods for searching multilingual document
collection, (vi) techniques for combining multiple
language resources, etc. Zhang Tao and Yue-Jie
Zhang in 2007 and 2008 implemented two systems
based on the work in CLIR evaluation task at the 9th
Text Retrieval Conference (TREC-9): English-
Chinese [3] and Chinese-English [4]. The machine
readable dictionaries are utilized as the important
knowledge source to acquire correct translations for
English-Chinese and Chinese-English.
Chinnakotla Manoj Kumar et al. [5] in 2008
developed Marathi-English CLIR system which uses
bi-lingual dictionaries and query translation
approach. Simple rule based approach is used to
transliterate the query words which are not found in
the dictionary. The resulting multiple
translation/transliteration choices for each query
word are disambiguated using an iterative page-rank
style algorithm to produce the final translated query.
Almeida Ashish and Bhattacharyya P. [6] in 2008
studied the effects of lexical analysis on Marathi
monolingual search over the news domain corpus of
FIRE-2008 and observed the effect of processes
such as lemmatization, inclusion of suffixes in
indexing and stop words elimination on the retrieval
performance.
Sethuramalingam S et al. [7] in 2009 conducted
CLIR experiments between three languages which
uses writing systems (scripts) of Brahmi-origin,
namely Hindi, Bengali and Marathi. The similarity
of the writing systems used for these Indian
languages can be used effectively to improve CLIR
and also to overcome the problems of textual
variations, textual errors, and even the lack of
linguistic resources like stemmers to an extent. The
results were significantly improved for all the six
language pairs using a method for fuzzy text search
based on Surface Similarity. Nasharuddin et al. [8]
in 2010 reviewed some recent researches focusing
on topics in cross lingual information retrieval and
its role in research directions which include new
models and paradigms in the wide area of
information retrieval. In [9], various approaches for
CLIR were explained in detail.
Kumar Sourabh [10] in 2013 conducted an
extensive literature review on various developments
in India in machine translation system and Cross
Language IR. This survey paper covers ongoing
developments in CLIR and MT with respect to
Status of the projects, Current projects, Participants,
Existing projects, Government efforts, Funding and
financial aids, Twelth Five Year Plan (2012-2017)
Projections, Eleventh Five Year Plan (2007-2012)
activities. Arora P. et al. [11] in 2013 described
details of DCU’s participation in the Cross
Language !ndian News Story Search task (CL!NSS)
at FIRE 2013. This task is an edition of the
PAN@FIRE task which focuses on addressing news
story linking between English and Indian languages.
H. B. Patil et al. [12] in 2014 demonstrated a
rule-based Part-of-Speech tagger for Marathi
Language. The hand constructed rules that are
learned from corpus and some manual addition after
studying the grammar of Marathi language are added
and that are used for developing the tagger.
Disambiguation is done by analyzing the linguistic
feature of the word, its following word, its preceding
word etc. S. Varshney and J. Bajpai [13] in 2014
proposed an algorithm for improving the
performance of English-Hindi CLIR system. They
used all possible combination of Hindi translated
query using transliteration of English query terms
and choosing the best query among them for
retrieval of documents. The experiment is performed
on FIRE 2010 (Forum of Information Retrieval
Evaluation) datasets.
Pratibha Bajpai et al. [14] in 2014 analyzed the
various researchers work in the area of Indian
language CLIR. They had also presented prototype
for English to Hindi language CLIR and issues
related to the English to Hindi language translation.
The authors had tested 30 queries manually using
suggested prototype and found that the precision
level is quite good. Alessio Bosca et al. [15] in 2014
presented a CLIR system exploiting multilingual
ontologies for enriching documents’ representation
with multilingual semantic information during the
indexing phase and also for mapping query
fragments to concepts during the retrieval phase.
Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91
www.ijera.com 88 | P a g e
This system has been applied on a domain specific
document collection and the contribution of the
ontologies to the CLIR system has been evaluated in
conjunction with the use of both Google and
Microsoft Bing translation services. Results
demonstrated the use of domain-specific resources
which lead to a significant improvement of CLIR
system performance.
Sandhan [17] is a monolingual search system
developed under Technology Development for
Indian Languages (TDIL) programme in tourism
domain for five Indian languages viz. Bengali,
Marathi, Hindi, Tamil and Telugu. In this system,
user has the facility to submit query using the
INSCRIPT or phonetic layout. It processes the query
based on its language and retrieves results from the
respective language. Summary is generated for each
retrieved document and this feature helps the user in
knowing the basic information about the overall
content of the document without opening it.
III. PROPOSED METHODOLOGY
Marathi is one of the widely spoken languages
in India especially in the state of Maharashtra.
Marathi uses the Devanagari script and draws
vocabulary mainly from Sanskrit [5]. The
architecture of our Marathi-English CLIR system is
shown in Fig. 1. The system is based on query
translation approach as it is not feasible to translate
documents. The input to the system is Marathi topic
and description given in FIRE 2011 topic set. As
Marathi is morphologically rich like other Indian
languages, there is need to stem the query words to
get the Marathi root words. Query is translated into
English using bilingual dictionary and transliterated
using simple mapping approach from source
language characters to the target language
characters. Then all possible combinations of
translations are disambiguated and given to the
search engine.
Figure 1 System Architecture
A. Preprocessing of query
The system takes Marathi query as an input.
This query is first pre-processed to remove stop
words and to get root words. Stop words are
commonly used words that a search engine has been
programmed to ignore, both when indexing and
retrieving documents as the result of a search query
such as the, is, at, which, on, etc. Then the query
words are stemmed before looking up their entries in
the bi-lingual dictionary which is called as
Morphological Analysis. A simple lookup approach
is used to remove stop words and to get root words.
B. Query Translation
Translation of query from one language to other
language is known as query translation. Query
translation is a crucial step in CLIR system because
all problems come from this step like mismatching
of query terms, ambiguities, poor retrieval
performance etc. The system uses machine-readable
bi-lingual Marathi-English dictionary for query
translation. Bilingual dictionaries are specialized in
translating text and words from one language to
another. It is easy to translate query by replacing
words with its equivalent word available in bilingual
dictionary.
C. Transliteration
It is a process of converting any text or words
from one script to another. Many tools are available
for transliteration into English from Devanagari
format such as Itrans, IAST, etc. If a translation for
input query is not found in dictionary, the word is
assumed to be a proper noun and therefore
transliterated by the Marathi-English transliteration
module. The module is based on simple mapping
approach which returns appropriate English
character transliteration for each Marathi character.
D. Word Sense Disambiguation
In this paper, we propose a new unsupervised
corpus-based WSD algorithm which uses similarity
between senses of the target word and a context
vector. It also uses English dictionary as additional
resource to find synonyms for correctly sensed word
so that more relevant documents can be retrieved.
The sense of the target word is determined as that for
which the similarity between gloss and context
vector is greatest. WSD is used to detect the correct
sense of the Marathi word in English. This module
makes use of the English Barcelona corpus and
English dictionary available at [19]. The Barcelona
corpus is preprocessed by removing stop words and
then it is converted to excel database for easy
manipulation of the corpus.
Let us consider a query “गोध्रा जलद ट्रेन”. After
preprocessing, translation and transliteration, we get
two meanings for word जलद from bilingual
dictionary: “fast and cloud”. So, two translations can
be formed, first “godhra fast train” and second
“godhra cloud train”. Initially “godhra fast train” is
fed to WSD module. Now in WSD module, system
will search for each individual word of input
translation in excel corpus. If “train” is found in one
of the rows, then that row is locked and further
search is made for “godhra” and “fast” in the locked
Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91
www.ijera.com 89 | P a g e
row. Now if we find “fast” in the locked row then
“fast” is marked as correct sense. After this, “fast” is
searched again in the English-English dictionary to
get its synonyms. For the word “fast”; rapid, speedy,
etc. will be retrieved as synonyms. Thus by
searching correct sensed word in the English
dictionary, more relevant documents are retrieved.
The same procedure would be carried for “godhra
cloud train” but no correct sense is obtained for this
translation in excel corpus. Finally three translated
queries are submitted to the search engine: godhra
fast train, godhra rapid train and godhra speedy train.
E. Monolingual English IR
The possible disambiguated translations of
Marathi query are submitted to the English IR
engine. Apache Lucene is used for indexing the
input documents and searching the queries over
target collection. Lucene creates Inverted Index.
Normally, we map document => terms in the
document. But, Lucene does the reverse. It creates
index of terms => list of documents containing the
term, which makes it faster to search. Lucene
combines Boolean model (BM) of Information
Retrieval with Vector Space Model (VSM) of
Information Retrieval documents “approved” by BM
are scored by VSM [20].
The proposed algorithm is given below that
shows the step by step process of Marathi-English
CLIR.
Algorithm 1 Marathi-English CLIR by combining
translation and transliteration
Input: Query and description in Marathi script
Output: Relevant documents in English with respect
to the given query
1: Remove stop words from user query and get the
root words in Marathi
2: for each root word do
3: Translate into English language using Marathi-
English bilingual dictionary
4: if not found in dictionary then
5: Transliterate into English language
6: Make all possible translations in English using
step 3 and 5
7: Disambiguate all translations for each word
8: Submit disambiguated queries to English search
engine
9: Display all relevant English documents generated
to the users.
IV. DATASET AND RESOURCES
A. Document Collection
The FIRE (Forum of Information Retrieval
Evaluation) 2011 dataset is English document
collection consisting of news articles from “The
Telegraph, Calcutta Edition” from 2001-2010. For
this system we have considered news articles from
year 2004-07. Dataset consists of set of user queries
in terms of “Title” field, “Description” field and
“Narrative” field, set of documents and “qrel” files
which gives a list of relevant documents for queries.
The system uses Marathi queries to retrieve English
documents. For the proposed system, we used
“Title” and “Description” field of the queries.
Dataset is available online for free at [18].
B. CFILT Resources
The system uses various resources created by
Center for Indian Language Technologies (CFILT)
such as bilingual Marathi-English dictionary for
translation, Morphological analyzer to get the root
words in Marathi, Barcelona corpus and English
dictionary for word sense disambiguation. All
resources are available online at [19].
C. Apache Lucene
The proposed system uses open source Apache
Lucene [20] search engine library as English IR
engine, i.e. indexing the input documents and
searching relevant documents over the target
collection.
V. EXPERIMENTAL RESULTS
We have developed two systems: CLIR1 and
CLIR2 which are Marathi-English CLIR based on
query translation approach. Both systems takes
query in Marathi language which is translated into
English using bilingual dictionary after pre-
processing and the words which are not found in
dictionary are transliterated. The final query is given
to search engine to retrieve relevant documents in
English. The first system developed in this work has
no WSD module whereas second system contains
WSD. In addition to the input query, second system
takes two more inputs: description of a query and the
domain in which the system should retrieve results.
TABLE I shows statistics of data collection used for
implementation. TABLE II shows results for 15
queries which are tested on proposed system along
with graphs.
TABLE I Statistics of Data Collection
SN Metrics CLIR
1 Query Language Marathi
2 Document Language English
3 No. of queries 5
4 No. of Documents 1,20,379
5 Size of Collection 288MB
6
Avg. No. of relevant
documents per query
15
Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91
www.ijera.com 90 | P a g e
TABLE II Precision and Recall Results
SN Query
Recall Precision Recall Precision
Q1 स्ळाईन फ्ऱू ऱस 1 0.03 1 0.03
Q2 गोध्रा ट्रेन आक्रमण 0.617021 0.145 0.851064 0.2
Q3 मायकऱ जॅक्सनचा अकाऱी मृत्यू 1 0.025 1 0.025
Q4 बराक ओबामाचा वळजय 0.785714 0.055 1 0.07
Q5 अबू घाररब जेऱमधीऱ छल 0.685714 0.12 0.857143 0.15
Q6 मनोरंजनाच्या जगातीऱ चाचेगगरी 0.90625 0.145 0.9375 0.15
Q7 भारतीय महिऱा आरक्षण वळधेयक 0.666667 0.05 0.866667 0.065
Q8 सोमाऱी समुद्री चाच्यांचा ऩूणणऩणे ऩराभळ 1 0.055 1 0.055
Q9 क्ऱोन के ऱेल्या मानळी मूऱांचा जन्म 1 0.055 1 0.055
Q10 बेकायदेऴीर झाडे तोडणे 0.486486 0.18 0.675676 0.25
Q11 भोऩाल ळायू दुघणटना 0.961538 0.25 1 0.26
Q12 बेनजीर भुट्टो यांची ित्या 0.647059 0.055 0.882353 0.075
Q13 भारतातीऱ सायबर गुन्िा 0.961538 0.125 0.961538 0.125
Q14 हदल्ऱीमधीऱ राष्ट्ट्रकु ऱ खेल 0.666667 0.06 0.888889 0.08
Q15 बबऱ गेट्सचे ऩरोऩकारी प्रयत्न 1 0.025 1 0.025
CLIR1 CLIR2
Figure 2 Recall Results
Figure 3 Precision Results
VI. CONCLUSION
Cross language IR is a technique for searching
documents in many languages across the world and
it can be the baseline for searching not only among
two languages but also in multiple languages. In this
paper, we have proposed a new approach for word
sense disambiguation of Marathi-English CLIR
system which makes use of corpus and dictionary in
excel format. In addition to query, the user can
provide description and domain for the query. The
approach of using detailed query performed well and
retrieved relevant results with good precision.
In future, we would like to explore alternative
search engines and scoring functions instead of
standard Lucene scoring function.
Acknowledgements
We would like to thank the publishers as well as
researchers for making their resources available and
teachers for their guidance. We are thankful to
authorities of Savitribai Phule Pune University for
their constant guidelines and support. We also thank
the college authorities for providing the required
infrastructure and support. Lastly, we would like to
extend a heartfelt gratitude to friends and family
members who have inspired, guided and assisted us
in performing this work.
REFERENCES
[1] D. Wu and D. He, “A study of query translation
using google machine translation system,” in
Computational Intelligence and Software
Engineering (CiSE), 2010 International
Conference on, pp. 1–4, IEEE, 2010.
[2] K. Kishida, “Technical issues of cross-language
information retrieval: a review,” Information
processing & management, vol. 41, no. 3, pp.
433– 455, 2005.
[3] Zhang Yue-Jie and Tao Zhang, “Research on
English-Chinese Cross- Language Information
Retrieval,” International Conference on
Machine Learning and Cybernetics. IEEE, 2007.
[4] Zhang Tao and Yue-Jie Zhang, “Research on
Chinese-English Cross- Language Information
Retrieval,” International Conference on
Machine Learning and Cybernetics. IEEE, 2008.
[5] D. O. P. Chinnakotla Manoj Kumar, Ranadive
Sagar and B. Pushpak, “Hindi to english and
marathi to english cross language information
retrieval evaluation,” in Advances in
Multilingual and Multimodal Information
Retrieval, pp. 111–118, Springer, 2008.
[6] A. Almeida and P. Bhattacharyya, “Using
morphology to improve Marathi monolingual
information retrieval,” FIRE Working Note,
2008.
[7] S. Subramaniam, A. K. Singh, P. Dasigi, and V.
Varma, “Experiments in clir using fuzzy string
search based on surface similarity,” in
Proceeding of the 32nd international ACM
SIGIR conference on Research and
development in information retrieval, pp. 682–
683, ACM, 2009.
[8] N. A. Nasharuddin, “Cross-lingual information
retrieval state-of-the-art,” electronic Journal of
Computer Science and Information Technology
(eJCSIT), vol. 2, no. 1, pp. 1–5, 2010.
[9] Nasharuddin, Nurul Amelina, et al, “A review
on the cross-lingual information retrieval,”
Information Retrieval and Knowledge
Management, (CAMP), 2010 International
Conference. IEEE, 2010.
[10]K. Sourabh, “An extensive literature review on
clir and mt activities in india,” International
Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91
www.ijera.com 91 | P a g e
Journal of Scientific and Engineering Research,
2013.
[11]A. Piyush, J. Foster, and G. J. Jones, “Dcu at
fire 2013: cross-language !ndian news story
search,” 2013.
[12]H.B. Patil, A.S. Patil and B.V. Pawar, “Part-of-
Speech Tagger for Marathi Language using
Limited Training Corpora,” International
Journal of Computer Applications, 2014
[13]S. Varshney and J. Bajpai, “Improving
performance of english-hindi cross language
information retrieval using transliteration of
query terms,” arXiv preprint arXiv:1401.3510,
2014.
[14]Pratibha Bajpai and Dr. Parul Verma, “Cross
language information retrieval in indian
language perspective,” IJRET: International
Journal of Research in Engineering and
Technology, 2014.
[15]Alessio Bosca, M. Casu, M. Dragoni, and C. D.
Francescomarino, “Using Semantic and
Domain-Based Information in CLIR Systems,”
The Semantic Web: Trends and Challenges.
Springer International Publishing, 2014.
[16]S. Mayanale and S. Pawar, “Survey of clir and
mt systems in marathi language,” International
Journal of Computational Linguistics and
Natural Language Processing, 2015 (in press).
[17]Sandhan [Online]. Available: http://tdil-
dc.in/index.php?option=comcontent&view=arti
cle&id=66, http://tdil-
dc.in/Sandhan/locale.jsp?hi
[18]Fire data collection [Online]. Available:
http://www.isical.ac.in/clia/data.html
[19]Cfilt resources [Online]. Available:
http://www.cfilt.iitb.ac.in/Downloads.html
[20]Apache lucene core [Online]. Available:
http://lucene.apache.org/core/

More Related Content

What's hot

SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSSENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
 
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer IJCSIS Research Publications
 
Attitude od teachers_and_students_towards_classroom_code_switching-libre
Attitude od teachers_and_students_towards_classroom_code_switching-libreAttitude od teachers_and_students_towards_classroom_code_switching-libre
Attitude od teachers_and_students_towards_classroom_code_switching-libreStoic Mills
 
Developemnt and evaluation of a web based question answering system for arabi...
Developemnt and evaluation of a web based question answering system for arabi...Developemnt and evaluation of a web based question answering system for arabi...
Developemnt and evaluation of a web based question answering system for arabi...ijnlc
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
 
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urduijaia
 
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...Eswar Publications
 
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEMULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
 
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONMACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONijnlc
 
Error Analysis of College Students' Sentences (SLA Final Exam)
Error Analysis of College Students' Sentences (SLA Final Exam)Error Analysis of College Students' Sentences (SLA Final Exam)
Error Analysis of College Students' Sentences (SLA Final Exam)Diyana Sulistyani
 
A New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic TextA New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
 
Hybrid approaches for automatic vowelization of arabic texts
Hybrid approaches for automatic vowelization of arabic textsHybrid approaches for automatic vowelization of arabic texts
Hybrid approaches for automatic vowelization of arabic textsijnlc
 
Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia
Crowdsourcing in developing repository of phrase definition in Bahasa IndonesiaCrowdsourcing in developing repository of phrase definition in Bahasa Indonesia
Crowdsourcing in developing repository of phrase definition in Bahasa IndonesiaTELKOMNIKA JOURNAL
 
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
 
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUEIMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUEcsandit
 

What's hot (20)

SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSSENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
 
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
 
Attitude od teachers_and_students_towards_classroom_code_switching-libre
Attitude od teachers_and_students_towards_classroom_code_switching-libreAttitude od teachers_and_students_towards_classroom_code_switching-libre
Attitude od teachers_and_students_towards_classroom_code_switching-libre
 
Developemnt and evaluation of a web based question answering system for arabi...
Developemnt and evaluation of a web based question answering system for arabi...Developemnt and evaluation of a web based question answering system for arabi...
Developemnt and evaluation of a web based question answering system for arabi...
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...
 
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
 
An unsupervised approach to develop ir system the case of urdu
An unsupervised approach to develop ir system  the case of urduAn unsupervised approach to develop ir system  the case of urdu
An unsupervised approach to develop ir system the case of urdu
 
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
A Review on Speech Corpus Development for Automatic Speech Recognition in Ind...
 
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEMULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGE
 
Cl35491494
Cl35491494Cl35491494
Cl35491494
 
C.s & c.m
C.s & c.mC.s & c.m
C.s & c.m
 
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONMACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATION
 
Error Analysis of College Students' Sentences (SLA Final Exam)
Error Analysis of College Students' Sentences (SLA Final Exam)Error Analysis of College Students' Sentences (SLA Final Exam)
Error Analysis of College Students' Sentences (SLA Final Exam)
 
A New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic TextA New Concept Extraction Method for Ontology Construction From Arabic Text
A New Concept Extraction Method for Ontology Construction From Arabic Text
 
Hybrid approaches for automatic vowelization of arabic texts
Hybrid approaches for automatic vowelization of arabic textsHybrid approaches for automatic vowelization of arabic texts
Hybrid approaches for automatic vowelization of arabic texts
 
E1 geetha2 karthikeyan
E1 geetha2 karthikeyanE1 geetha2 karthikeyan
E1 geetha2 karthikeyan
 
Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia
Crowdsourcing in developing repository of phrase definition in Bahasa IndonesiaCrowdsourcing in developing repository of phrase definition in Bahasa Indonesia
Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia
 
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...
 
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUEIMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
 

Viewers also liked

A Design Technique To Reduce Nbti Effects From 5t Sram Cells
A Design Technique To Reduce Nbti Effects From 5t Sram CellsA Design Technique To Reduce Nbti Effects From 5t Sram Cells
A Design Technique To Reduce Nbti Effects From 5t Sram CellsIJERA Editor
 
한글시계웍샵_SW
한글시계웍샵_SW한글시계웍샵_SW
한글시계웍샵_SW영광 송
 
Pieza publicitaria de acuerdo a Barthes
Pieza publicitaria de acuerdo a BarthesPieza publicitaria de acuerdo a Barthes
Pieza publicitaria de acuerdo a BarthesDaniel Carpinteyro
 
Làm Sếp không chỉ là nghệ thuật
Làm Sếp không chỉ là nghệ thuậtLàm Sếp không chỉ là nghệ thuật
Làm Sếp không chỉ là nghệ thuậtNguyễn Huyền
 
Salarpuria Gold Summit
Salarpuria Gold SummitSalarpuria Gold Summit
Salarpuria Gold SummitBangalore Real
 
Desalination technology; economy and simplicity
Desalination technology; economy and simplicityDesalination technology; economy and simplicity
Desalination technology; economy and simplicityIJERA Editor
 
Vocab basic 2010
Vocab basic 2010Vocab basic 2010
Vocab basic 2010Shehr Yar
 
UFCD_4716_Princípios de gestão_índice
UFCD_4716_Princípios de gestão_índiceUFCD_4716_Princípios de gestão_índice
UFCD_4716_Princípios de gestão_índiceManuais Formação
 
Réveillon – uma história de crença sobre o
Réveillon – uma história de crença sobre oRéveillon – uma história de crença sobre o
Réveillon – uma história de crença sobre oPrivada
 
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...IJERA Editor
 
De thi cao hoc tieng anh trinh do b
De thi cao hoc tieng anh   trinh do bDe thi cao hoc tieng anh   trinh do b
De thi cao hoc tieng anh trinh do bTommy Bảo
 
Current and past country heads - Manu Melwin Joy
Current and past country heads - Manu Melwin JoyCurrent and past country heads - Manu Melwin Joy
Current and past country heads - Manu Melwin Joymanumelwinjoy2014
 
เหนือ เที่ยว
เหนือ   เที่ยวเหนือ   เที่ยว
เหนือ เที่ยวleemeanshun minzstar
 
Study of The Technological Profile of The Red Ceramic Industry of Alagoas
Study of The Technological Profile of The Red Ceramic Industry of AlagoasStudy of The Technological Profile of The Red Ceramic Industry of Alagoas
Study of The Technological Profile of The Red Ceramic Industry of AlagoasIJERA Editor
 

Viewers also liked (20)

A Design Technique To Reduce Nbti Effects From 5t Sram Cells
A Design Technique To Reduce Nbti Effects From 5t Sram CellsA Design Technique To Reduce Nbti Effects From 5t Sram Cells
A Design Technique To Reduce Nbti Effects From 5t Sram Cells
 
한글시계웍샵_SW
한글시계웍샵_SW한글시계웍샵_SW
한글시계웍샵_SW
 
Pieza publicitaria de acuerdo a Barthes
Pieza publicitaria de acuerdo a BarthesPieza publicitaria de acuerdo a Barthes
Pieza publicitaria de acuerdo a Barthes
 
841218 FCO's brief to Thatcher's visit to HK and China (18-21 December 1984) ...
841218 FCO's brief to Thatcher's visit to HK and China (18-21 December 1984) ...841218 FCO's brief to Thatcher's visit to HK and China (18-21 December 1984) ...
841218 FCO's brief to Thatcher's visit to HK and China (18-21 December 1984) ...
 
Làm Sếp không chỉ là nghệ thuật
Làm Sếp không chỉ là nghệ thuậtLàm Sếp không chỉ là nghệ thuật
Làm Sếp không chỉ là nghệ thuật
 
Salarpuria Gold Summit
Salarpuria Gold SummitSalarpuria Gold Summit
Salarpuria Gold Summit
 
Desalination technology; economy and simplicity
Desalination technology; economy and simplicityDesalination technology; economy and simplicity
Desalination technology; economy and simplicity
 
Unit 2
Unit 2Unit 2
Unit 2
 
Vocab basic 2010
Vocab basic 2010Vocab basic 2010
Vocab basic 2010
 
UFCD_4716_Princípios de gestão_índice
UFCD_4716_Princípios de gestão_índiceUFCD_4716_Princípios de gestão_índice
UFCD_4716_Princípios de gestão_índice
 
Réveillon – uma história de crença sobre o
Réveillon – uma história de crença sobre oRéveillon – uma história de crença sobre o
Réveillon – uma história de crença sobre o
 
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
Hardware Implementation of Two’s Compliment Multiplier with Partial Product b...
 
Jinsei no senshu doc
Jinsei no senshu docJinsei no senshu doc
Jinsei no senshu doc
 
De thi cao hoc tieng anh trinh do b
De thi cao hoc tieng anh   trinh do bDe thi cao hoc tieng anh   trinh do b
De thi cao hoc tieng anh trinh do b
 
Current and past country heads - Manu Melwin Joy
Current and past country heads - Manu Melwin JoyCurrent and past country heads - Manu Melwin Joy
Current and past country heads - Manu Melwin Joy
 
840920 1000 UK Cabinet Meeting on Hong Kong (20 September 1984)
840920 1000 UK Cabinet Meeting on Hong Kong (20 September 1984)840920 1000 UK Cabinet Meeting on Hong Kong (20 September 1984)
840920 1000 UK Cabinet Meeting on Hong Kong (20 September 1984)
 
841218 Steering Brief from FCO to Thatcher (China 18-21 December 1984)
841218 Steering Brief from FCO to Thatcher (China 18-21 December 1984)841218 Steering Brief from FCO to Thatcher (China 18-21 December 1984)
841218 Steering Brief from FCO to Thatcher (China 18-21 December 1984)
 
Resume2
Resume2Resume2
Resume2
 
เหนือ เที่ยว
เหนือ   เที่ยวเหนือ   เที่ยว
เหนือ เที่ยว
 
Study of The Technological Profile of The Red Ceramic Industry of Alagoas
Study of The Technological Profile of The Red Ceramic Industry of AlagoasStudy of The Technological Profile of The Red Ceramic Industry of Alagoas
Study of The Technological Profile of The Red Ceramic Industry of Alagoas
 

Similar to Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD

Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
 
A decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageA decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageacijjournal
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemAriel Hess
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...Scott Bou
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmerkevig
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESSTUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESmlaij
 
Attentive_fine-tuning_of_Transformers_for_Translat.pdf
Attentive_fine-tuning_of_Transformers_for_Translat.pdfAttentive_fine-tuning_of_Transformers_for_Translat.pdf
Attentive_fine-tuning_of_Transformers_for_Translat.pdfSatishBhalshankar
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALDICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
 
Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...IJECEIAES
 
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTSAUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTSIRJET Journal
 

Similar to Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD (20)

Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
 
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language TextsRBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Texts
 
A decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageA decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri language
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval System
 
Viva
VivaViva
Viva
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmer
 
Hf3413291335
Hf3413291335Hf3413291335
Hf3413291335
 
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESMULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUES
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESSTUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGES
 
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESMULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES
 
Attentive_fine-tuning_of_Transformers_for_Translat.pdf
Attentive_fine-tuning_of_Transformers_for_Translat.pdfAttentive_fine-tuning_of_Transformers_for_Translat.pdf
Attentive_fine-tuning_of_Transformers_for_Translat.pdf
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALDICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
 
Ac04507168175
Ac04507168175Ac04507168175
Ac04507168175
 
Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...
 
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTSAUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
 

Recently uploaded

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 

Recently uploaded (20)

complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 

Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD

  • 1. Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91 www.ijera.com 86 | P a g e Marathi-English CLIR using detailed user query and unsupervised corpus-based WSD Savita C. Mayanale*, Ms. S. S. Pawar** * (M.E. Student, Dept. of Computer Engg., D. Y. Patil College of Engg., Akurdi, Savitribai Phule Pune University, India.) ** (Asst. Prof., Dept. of Computer Engg., D. Y. Patil College of Engg., Akurdi, Savitribai Phule Pune University, India.) ABSTRACT With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level. Keywords - Morphological Analyser, CLIR, Dictionary-based Approach, word sense disambiguation. I. INTRODUCTION Information Retrieval (IR) refers to a process that users can find their required information or knowledge from corpus including different kinds of information. Traditional IR system is monolingual system. However, with the rapid growing amount of information available to us, the situations that a user needs to use a retrieval system to perform querying a multilingual document collection are becoming increasingly emerging and common. This tendency causes the difficulty of information acquisition. Meanwhile, language barriers become a serious problem. As a result CLIR has received more research attention and is increasingly being used to retrieve information on the Internet. India has multiple languages which are written in different scripts. Most of the people in India speak in their regional language whereas vast amount of information on the Internet is available in English. However, users who do not use English as first language are also significantly high in number. Such non English users find it difficult to express the query in appropriate English language. Proficiency in English language is becoming a kind of barrier in finding rich source of information available on World Wide Web. CLIR helps in bridging this gap. CLIR systems allow users to query in their native language and get relevant documents in other languages. Translations in CLIR can be achieved by two ways: Query translation or Document translation [16]. In Query Translation, the given query will be translated from native language to target language and then search operation is performed to get the relevant documents. In document translation, all the documents from corpus are translated into native language. It allows the user to ask query in native language and then the searching will take place to obtain the relevant documents in native language. Among the two, the query translation is easier compared with document translation because of the size of translation [8]. But, the drawback with query translation is that the given query normally will be short and hence ambiguity problem may arise. These translation methods can be further classified into three classes according to what resources are used to cross the language boundary as Machine Translation (MT), dictionary based and parallel corpora based. In [1], CLIR experiments were conducted using Google Translator for translating queries to study the effectiveness of CLIR using MT for query translation. In [5] and [13], CLIR systems were implemented using bilingual dictionaries. Parallel or comparable corpora are useful resources to extract beneficial information for CLIR [14]. The performance of CLIR system mainly depends on accurate translation of users input query. But, most of the times user's query contains ambiguous terms which results in retrieval of irrelevant documents also along-with relevant RESEARCH ARTICLE OPEN ACCESS
  • 2. Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91 www.ijera.com 87 | P a g e documents. This reduces the overall precision. In order to improve the retrieval efficiency; Word Sense Disambiguation (WSD) is used. WSD approaches are classified in two major categories depending on the corpora they use. Supervised approaches need sense tagged corpora for training the model whereas unsupervised approaches work on untagged corpora. There are some semi- supervised approaches which work with small annotated seed data. Another way to classify WSD approaches is based on the use of knowledge resources. Approaches using knowledge resources like wordnet, ontologies, thesauri are termed as knowledge based approaches, while there are some approaches which work without any of these resources, which are termed as knowledge-lean or corpus based approaches. II. RELATED WORK The Kazuaki Kishida [2] in 2004 reviewed techniques and methods for enhancing effectiveness of cross-language information retrieval. They have covered research issues such as (i) matching strategies and translation techniques, (ii) methods for solving the problem of translation ambiguity, (iii) formal models for CLIR such as application of the language model, (iv) the pivot language approach, (v) methods for searching multilingual document collection, (vi) techniques for combining multiple language resources, etc. Zhang Tao and Yue-Jie Zhang in 2007 and 2008 implemented two systems based on the work in CLIR evaluation task at the 9th Text Retrieval Conference (TREC-9): English- Chinese [3] and Chinese-English [4]. The machine readable dictionaries are utilized as the important knowledge source to acquire correct translations for English-Chinese and Chinese-English. Chinnakotla Manoj Kumar et al. [5] in 2008 developed Marathi-English CLIR system which uses bi-lingual dictionaries and query translation approach. Simple rule based approach is used to transliterate the query words which are not found in the dictionary. The resulting multiple translation/transliteration choices for each query word are disambiguated using an iterative page-rank style algorithm to produce the final translated query. Almeida Ashish and Bhattacharyya P. [6] in 2008 studied the effects of lexical analysis on Marathi monolingual search over the news domain corpus of FIRE-2008 and observed the effect of processes such as lemmatization, inclusion of suffixes in indexing and stop words elimination on the retrieval performance. Sethuramalingam S et al. [7] in 2009 conducted CLIR experiments between three languages which uses writing systems (scripts) of Brahmi-origin, namely Hindi, Bengali and Marathi. The similarity of the writing systems used for these Indian languages can be used effectively to improve CLIR and also to overcome the problems of textual variations, textual errors, and even the lack of linguistic resources like stemmers to an extent. The results were significantly improved for all the six language pairs using a method for fuzzy text search based on Surface Similarity. Nasharuddin et al. [8] in 2010 reviewed some recent researches focusing on topics in cross lingual information retrieval and its role in research directions which include new models and paradigms in the wide area of information retrieval. In [9], various approaches for CLIR were explained in detail. Kumar Sourabh [10] in 2013 conducted an extensive literature review on various developments in India in machine translation system and Cross Language IR. This survey paper covers ongoing developments in CLIR and MT with respect to Status of the projects, Current projects, Participants, Existing projects, Government efforts, Funding and financial aids, Twelth Five Year Plan (2012-2017) Projections, Eleventh Five Year Plan (2007-2012) activities. Arora P. et al. [11] in 2013 described details of DCU’s participation in the Cross Language !ndian News Story Search task (CL!NSS) at FIRE 2013. This task is an edition of the PAN@FIRE task which focuses on addressing news story linking between English and Indian languages. H. B. Patil et al. [12] in 2014 demonstrated a rule-based Part-of-Speech tagger for Marathi Language. The hand constructed rules that are learned from corpus and some manual addition after studying the grammar of Marathi language are added and that are used for developing the tagger. Disambiguation is done by analyzing the linguistic feature of the word, its following word, its preceding word etc. S. Varshney and J. Bajpai [13] in 2014 proposed an algorithm for improving the performance of English-Hindi CLIR system. They used all possible combination of Hindi translated query using transliteration of English query terms and choosing the best query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of Information Retrieval Evaluation) datasets. Pratibha Bajpai et al. [14] in 2014 analyzed the various researchers work in the area of Indian language CLIR. They had also presented prototype for English to Hindi language CLIR and issues related to the English to Hindi language translation. The authors had tested 30 queries manually using suggested prototype and found that the precision level is quite good. Alessio Bosca et al. [15] in 2014 presented a CLIR system exploiting multilingual ontologies for enriching documents’ representation with multilingual semantic information during the indexing phase and also for mapping query fragments to concepts during the retrieval phase.
  • 3. Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91 www.ijera.com 88 | P a g e This system has been applied on a domain specific document collection and the contribution of the ontologies to the CLIR system has been evaluated in conjunction with the use of both Google and Microsoft Bing translation services. Results demonstrated the use of domain-specific resources which lead to a significant improvement of CLIR system performance. Sandhan [17] is a monolingual search system developed under Technology Development for Indian Languages (TDIL) programme in tourism domain for five Indian languages viz. Bengali, Marathi, Hindi, Tamil and Telugu. In this system, user has the facility to submit query using the INSCRIPT or phonetic layout. It processes the query based on its language and retrieves results from the respective language. Summary is generated for each retrieved document and this feature helps the user in knowing the basic information about the overall content of the document without opening it. III. PROPOSED METHODOLOGY Marathi is one of the widely spoken languages in India especially in the state of Maharashtra. Marathi uses the Devanagari script and draws vocabulary mainly from Sanskrit [5]. The architecture of our Marathi-English CLIR system is shown in Fig. 1. The system is based on query translation approach as it is not feasible to translate documents. The input to the system is Marathi topic and description given in FIRE 2011 topic set. As Marathi is morphologically rich like other Indian languages, there is need to stem the query words to get the Marathi root words. Query is translated into English using bilingual dictionary and transliterated using simple mapping approach from source language characters to the target language characters. Then all possible combinations of translations are disambiguated and given to the search engine. Figure 1 System Architecture A. Preprocessing of query The system takes Marathi query as an input. This query is first pre-processed to remove stop words and to get root words. Stop words are commonly used words that a search engine has been programmed to ignore, both when indexing and retrieving documents as the result of a search query such as the, is, at, which, on, etc. Then the query words are stemmed before looking up their entries in the bi-lingual dictionary which is called as Morphological Analysis. A simple lookup approach is used to remove stop words and to get root words. B. Query Translation Translation of query from one language to other language is known as query translation. Query translation is a crucial step in CLIR system because all problems come from this step like mismatching of query terms, ambiguities, poor retrieval performance etc. The system uses machine-readable bi-lingual Marathi-English dictionary for query translation. Bilingual dictionaries are specialized in translating text and words from one language to another. It is easy to translate query by replacing words with its equivalent word available in bilingual dictionary. C. Transliteration It is a process of converting any text or words from one script to another. Many tools are available for transliteration into English from Devanagari format such as Itrans, IAST, etc. If a translation for input query is not found in dictionary, the word is assumed to be a proper noun and therefore transliterated by the Marathi-English transliteration module. The module is based on simple mapping approach which returns appropriate English character transliteration for each Marathi character. D. Word Sense Disambiguation In this paper, we propose a new unsupervised corpus-based WSD algorithm which uses similarity between senses of the target word and a context vector. It also uses English dictionary as additional resource to find synonyms for correctly sensed word so that more relevant documents can be retrieved. The sense of the target word is determined as that for which the similarity between gloss and context vector is greatest. WSD is used to detect the correct sense of the Marathi word in English. This module makes use of the English Barcelona corpus and English dictionary available at [19]. The Barcelona corpus is preprocessed by removing stop words and then it is converted to excel database for easy manipulation of the corpus. Let us consider a query “गोध्रा जलद ट्रेन”. After preprocessing, translation and transliteration, we get two meanings for word जलद from bilingual dictionary: “fast and cloud”. So, two translations can be formed, first “godhra fast train” and second “godhra cloud train”. Initially “godhra fast train” is fed to WSD module. Now in WSD module, system will search for each individual word of input translation in excel corpus. If “train” is found in one of the rows, then that row is locked and further search is made for “godhra” and “fast” in the locked
  • 4. Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91 www.ijera.com 89 | P a g e row. Now if we find “fast” in the locked row then “fast” is marked as correct sense. After this, “fast” is searched again in the English-English dictionary to get its synonyms. For the word “fast”; rapid, speedy, etc. will be retrieved as synonyms. Thus by searching correct sensed word in the English dictionary, more relevant documents are retrieved. The same procedure would be carried for “godhra cloud train” but no correct sense is obtained for this translation in excel corpus. Finally three translated queries are submitted to the search engine: godhra fast train, godhra rapid train and godhra speedy train. E. Monolingual English IR The possible disambiguated translations of Marathi query are submitted to the English IR engine. Apache Lucene is used for indexing the input documents and searching the queries over target collection. Lucene creates Inverted Index. Normally, we map document => terms in the document. But, Lucene does the reverse. It creates index of terms => list of documents containing the term, which makes it faster to search. Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval documents “approved” by BM are scored by VSM [20]. The proposed algorithm is given below that shows the step by step process of Marathi-English CLIR. Algorithm 1 Marathi-English CLIR by combining translation and transliteration Input: Query and description in Marathi script Output: Relevant documents in English with respect to the given query 1: Remove stop words from user query and get the root words in Marathi 2: for each root word do 3: Translate into English language using Marathi- English bilingual dictionary 4: if not found in dictionary then 5: Transliterate into English language 6: Make all possible translations in English using step 3 and 5 7: Disambiguate all translations for each word 8: Submit disambiguated queries to English search engine 9: Display all relevant English documents generated to the users. IV. DATASET AND RESOURCES A. Document Collection The FIRE (Forum of Information Retrieval Evaluation) 2011 dataset is English document collection consisting of news articles from “The Telegraph, Calcutta Edition” from 2001-2010. For this system we have considered news articles from year 2004-07. Dataset consists of set of user queries in terms of “Title” field, “Description” field and “Narrative” field, set of documents and “qrel” files which gives a list of relevant documents for queries. The system uses Marathi queries to retrieve English documents. For the proposed system, we used “Title” and “Description” field of the queries. Dataset is available online for free at [18]. B. CFILT Resources The system uses various resources created by Center for Indian Language Technologies (CFILT) such as bilingual Marathi-English dictionary for translation, Morphological analyzer to get the root words in Marathi, Barcelona corpus and English dictionary for word sense disambiguation. All resources are available online at [19]. C. Apache Lucene The proposed system uses open source Apache Lucene [20] search engine library as English IR engine, i.e. indexing the input documents and searching relevant documents over the target collection. V. EXPERIMENTAL RESULTS We have developed two systems: CLIR1 and CLIR2 which are Marathi-English CLIR based on query translation approach. Both systems takes query in Marathi language which is translated into English using bilingual dictionary after pre- processing and the words which are not found in dictionary are transliterated. The final query is given to search engine to retrieve relevant documents in English. The first system developed in this work has no WSD module whereas second system contains WSD. In addition to the input query, second system takes two more inputs: description of a query and the domain in which the system should retrieve results. TABLE I shows statistics of data collection used for implementation. TABLE II shows results for 15 queries which are tested on proposed system along with graphs. TABLE I Statistics of Data Collection SN Metrics CLIR 1 Query Language Marathi 2 Document Language English 3 No. of queries 5 4 No. of Documents 1,20,379 5 Size of Collection 288MB 6 Avg. No. of relevant documents per query 15
  • 5. Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91 www.ijera.com 90 | P a g e TABLE II Precision and Recall Results SN Query Recall Precision Recall Precision Q1 स्ळाईन फ्ऱू ऱस 1 0.03 1 0.03 Q2 गोध्रा ट्रेन आक्रमण 0.617021 0.145 0.851064 0.2 Q3 मायकऱ जॅक्सनचा अकाऱी मृत्यू 1 0.025 1 0.025 Q4 बराक ओबामाचा वळजय 0.785714 0.055 1 0.07 Q5 अबू घाररब जेऱमधीऱ छल 0.685714 0.12 0.857143 0.15 Q6 मनोरंजनाच्या जगातीऱ चाचेगगरी 0.90625 0.145 0.9375 0.15 Q7 भारतीय महिऱा आरक्षण वळधेयक 0.666667 0.05 0.866667 0.065 Q8 सोमाऱी समुद्री चाच्यांचा ऩूणणऩणे ऩराभळ 1 0.055 1 0.055 Q9 क्ऱोन के ऱेल्या मानळी मूऱांचा जन्म 1 0.055 1 0.055 Q10 बेकायदेऴीर झाडे तोडणे 0.486486 0.18 0.675676 0.25 Q11 भोऩाल ळायू दुघणटना 0.961538 0.25 1 0.26 Q12 बेनजीर भुट्टो यांची ित्या 0.647059 0.055 0.882353 0.075 Q13 भारतातीऱ सायबर गुन्िा 0.961538 0.125 0.961538 0.125 Q14 हदल्ऱीमधीऱ राष्ट्ट्रकु ऱ खेल 0.666667 0.06 0.888889 0.08 Q15 बबऱ गेट्सचे ऩरोऩकारी प्रयत्न 1 0.025 1 0.025 CLIR1 CLIR2 Figure 2 Recall Results Figure 3 Precision Results VI. CONCLUSION Cross language IR is a technique for searching documents in many languages across the world and it can be the baseline for searching not only among two languages but also in multiple languages. In this paper, we have proposed a new approach for word sense disambiguation of Marathi-English CLIR system which makes use of corpus and dictionary in excel format. In addition to query, the user can provide description and domain for the query. The approach of using detailed query performed well and retrieved relevant results with good precision. In future, we would like to explore alternative search engines and scoring functions instead of standard Lucene scoring function. Acknowledgements We would like to thank the publishers as well as researchers for making their resources available and teachers for their guidance. We are thankful to authorities of Savitribai Phule Pune University for their constant guidelines and support. We also thank the college authorities for providing the required infrastructure and support. Lastly, we would like to extend a heartfelt gratitude to friends and family members who have inspired, guided and assisted us in performing this work. REFERENCES [1] D. Wu and D. He, “A study of query translation using google machine translation system,” in Computational Intelligence and Software Engineering (CiSE), 2010 International Conference on, pp. 1–4, IEEE, 2010. [2] K. Kishida, “Technical issues of cross-language information retrieval: a review,” Information processing & management, vol. 41, no. 3, pp. 433– 455, 2005. [3] Zhang Yue-Jie and Tao Zhang, “Research on English-Chinese Cross- Language Information Retrieval,” International Conference on Machine Learning and Cybernetics. IEEE, 2007. [4] Zhang Tao and Yue-Jie Zhang, “Research on Chinese-English Cross- Language Information Retrieval,” International Conference on Machine Learning and Cybernetics. IEEE, 2008. [5] D. O. P. Chinnakotla Manoj Kumar, Ranadive Sagar and B. Pushpak, “Hindi to english and marathi to english cross language information retrieval evaluation,” in Advances in Multilingual and Multimodal Information Retrieval, pp. 111–118, Springer, 2008. [6] A. Almeida and P. Bhattacharyya, “Using morphology to improve Marathi monolingual information retrieval,” FIRE Working Note, 2008. [7] S. Subramaniam, A. K. Singh, P. Dasigi, and V. Varma, “Experiments in clir using fuzzy string search based on surface similarity,” in Proceeding of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 682– 683, ACM, 2009. [8] N. A. Nasharuddin, “Cross-lingual information retrieval state-of-the-art,” electronic Journal of Computer Science and Information Technology (eJCSIT), vol. 2, no. 1, pp. 1–5, 2010. [9] Nasharuddin, Nurul Amelina, et al, “A review on the cross-lingual information retrieval,” Information Retrieval and Knowledge Management, (CAMP), 2010 International Conference. IEEE, 2010. [10]K. Sourabh, “An extensive literature review on clir and mt activities in india,” International
  • 6. Savita C. Mayanale Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 5, Issue 6, ( Part -3) June 2015, pp.86-91 www.ijera.com 91 | P a g e Journal of Scientific and Engineering Research, 2013. [11]A. Piyush, J. Foster, and G. J. Jones, “Dcu at fire 2013: cross-language !ndian news story search,” 2013. [12]H.B. Patil, A.S. Patil and B.V. Pawar, “Part-of- Speech Tagger for Marathi Language using Limited Training Corpora,” International Journal of Computer Applications, 2014 [13]S. Varshney and J. Bajpai, “Improving performance of english-hindi cross language information retrieval using transliteration of query terms,” arXiv preprint arXiv:1401.3510, 2014. [14]Pratibha Bajpai and Dr. Parul Verma, “Cross language information retrieval in indian language perspective,” IJRET: International Journal of Research in Engineering and Technology, 2014. [15]Alessio Bosca, M. Casu, M. Dragoni, and C. D. Francescomarino, “Using Semantic and Domain-Based Information in CLIR Systems,” The Semantic Web: Trends and Challenges. Springer International Publishing, 2014. [16]S. Mayanale and S. Pawar, “Survey of clir and mt systems in marathi language,” International Journal of Computational Linguistics and Natural Language Processing, 2015 (in press). [17]Sandhan [Online]. Available: http://tdil- dc.in/index.php?option=comcontent&view=arti cle&id=66, http://tdil- dc.in/Sandhan/locale.jsp?hi [18]Fire data collection [Online]. Available: http://www.isical.ac.in/clia/data.html [19]Cfilt resources [Online]. Available: http://www.cfilt.iitb.ac.in/Downloads.html [20]Apache lucene core [Online]. Available: http://lucene.apache.org/core/