IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Myanmar Named Entity Recognition with Hidden Markov Modelijtsrd
Named Entity Recognition is the process to detect Named Entities NEs in a file , document or from a corpus and to categorize them into certain Named entity classes like name of city, State, Country, organization, person, location, sport, river, quantity etc. This paper introduces the Named Entities Recognition NER for Myanmar language using Hidden Markov Model HMM .The main idea behind the use of HMM language independent and we can apply this system for any language domain. The corpus used by our NER system is also not domain specific. Khin Khin Lay | Aung Cho ""Myanmar Named Entity Recognition with Hidden Markov Model"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd24012.pdf
Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/24012/myanmar-named-entity-recognition-with-hidden-markov-model/khin-khin-lay
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
The Paper presents a new method for translating a text document from Tamil to English. Our method is based on the Statistical Machine Translation Approach, combined with the Morphological Analysis, due to the fact that Tamil is a highly-inflected language. This paper presents a slight modification in SMT to make the approach more efficient and effective, and the experimental results have proven the method to be speed and accurate in the translation process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Clustering of medline documents using semi supervised spectral clusteringeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Myanmar Named Entity Recognition with Hidden Markov Modelijtsrd
Named Entity Recognition is the process to detect Named Entities NEs in a file , document or from a corpus and to categorize them into certain Named entity classes like name of city, State, Country, organization, person, location, sport, river, quantity etc. This paper introduces the Named Entities Recognition NER for Myanmar language using Hidden Markov Model HMM .The main idea behind the use of HMM language independent and we can apply this system for any language domain. The corpus used by our NER system is also not domain specific. Khin Khin Lay | Aung Cho ""Myanmar Named Entity Recognition with Hidden Markov Model"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd24012.pdf
Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/24012/myanmar-named-entity-recognition-with-hidden-markov-model/khin-khin-lay
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
The Paper presents a new method for translating a text document from Tamil to English. Our method is based on the Statistical Machine Translation Approach, combined with the Morphological Analysis, due to the fact that Tamil is a highly-inflected language. This paper presents a slight modification in SMT to make the approach more efficient and effective, and the experimental results have proven the method to be speed and accurate in the translation process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Clustering of medline documents using semi supervised spectral clusteringeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Orchestration of web services based on t qo s using user and web services agenteSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Experimental study of the forces above and under the vibration insulators of ...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Analysis of rc bridge decks for selected national and internationalstandard l...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language
pair. The machine translation system will take input script as English sentence and parse with the help of
Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the
machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will
take the parsed output and separate the source text word by word and searches for their corresponding
target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also
reordering rules are there. After applying the reordering rules, English sentence will be syntactically
reordered to suit Marathi language.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
QUrdPro: Query processing system for Urdu LanguageIJERA Editor
The tremendous increase in the multilingual data on the internet has increased the demand for efficient retrieval of information. Urdu is one of the widely spoken and written languages of south Asia. Due to unstructured format of Urdu language information retrieval of information is a big challenge. Question Answering systems aims to retrieve point-to-point answers rather than flooding with documents. It is needed when the user gets an in depth knowledge in a particular domain. When user needs some information, it must give the relevant answer. The question-answer retrieval of ontology knowledge base provides a convenient way to obtain knowledge for use, but the natural language need to be mapped to the query statement of ontology. This paper describes a query processing system QUrdPro based on ontology. This system is a combination of NLP and Ontology. It makes use of ontology in several phases for efficient query processing. Our focus is on the knowledge derived from the concepts used in the ontology and the relationship between these concepts. In this paper we describe the architecture of QUrdPro ,query processing system for Urdu and process model for the system is also discussed in detail.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
A decision tree based word sense disambiguation system in manipuri languageacijjournal
This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri
language. The paper discusses related attempts made in the Manipuri language followed by the proposed
plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study.
Conventional positional and context based features are suggested to capture the sense of the words, which
have ambiguous and multiple senses. The proposed work is expected to predict the senses of the
polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The
system produces an accuracy of 71.75 %.
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
Identification of monolingual and code-switch information from English-Kannad...IJECEIAES
Code-switching is a very common occurrence in social media communication, predominantly found in multilingual countries like India. Using more than one language in communication is known as codeswitching or code-mixing. Some of the important applications of codeswitch are machine translation (MT), shallow parsing, dialog systems, and semantic parsing. Identifying code-switch and monolingual information is useful for better communication in online networking websites. In this paper, we performed a character level n-gram approach to identify monolingual and code-switch information from English-Kannada social media data. We paralleled various machine learning techniques such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR) and neural network (NN) on English-Kannada code-switch (EKCS) data. From the proposed approach, it is observed that the character level n-gram approach provides 1.8% to 4.1% of improvement in terms of Accuracy and 1.6% to 3.8% of improvement in F1-score. Also observed that SVC and NN techniques are outperformed in terms of accuracy (97.9%) and F1-score (98%) with character level n-gram.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
A Novel approach for Document Clustering using Concept ExtractionAM Publications
In this paper we present a novel approach to extract the concept from a document and cluster such set of documents depending on the concept extracted from each of them. We transform the corpus into vector space by using term frequency–inverse document frequency then calculate the cosine distance between each document, followed by clustering them using K means algorithm. We also use multidimensional scaling to reduce the dimensionality within the corpus. It results in the grouping of documents which are most similar to each other with respect to their content and the genre.
Similar to Cross language information retrieval in indian (20)
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSveerababupersonal22
It consists of cw radar and fmcw radar ,range measurement,if amplifier and fmcw altimeterThe CW radar operates using continuous wave transmission, while the FMCW radar employs frequency-modulated continuous wave technology. Range measurement is a crucial aspect of radar systems, providing information about the distance to a target. The IF amplifier plays a key role in signal processing, amplifying intermediate frequency signals for further analysis. The FMCW altimeter utilizes frequency-modulated continuous wave technology to accurately measure altitude above a reference point.
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Cross language information retrieval in indian
1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 46
CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE Pratibha Bajpai1, Dr. Parul Verma2 1Research Scholar, Department of Information Technology, Amity University, Lucknow 2Assistant Professor, Department of Computer Science, Amity University, Lucknow Abstract Multilingual information is overflowing on internet these days. This increasing diversity of web pages in almost every popular language in the world should enable the user to access information in any language of his choice. But sometimes it is difficult for a user to write her request in a language which she could easily read and understand. This makes cross-language information retrieval (CLIR) and multilingual information retrieval (MLIR) for Web applications a valuable need of the day. It increases the accessibility of web users to retrieve information in any language while post their queries in their native language. The paper critically analyzes the various researchers work in the area of Indian language CLIR. In this paper we also present our prospective prototype for English to Hindi language CLIR. It will also discuss the issues related to the English to Hindi language translation. We had tested 30 queries manually using suggested prototype and found that the precision level is quite good. Keywords: Cross lingual Information Retrieval, Query Translation, Sense Disambiguation, English to Hindi Translation.
--------------------------------------------------------------------***------------------------------------------------------------------ 1. INTRODUCTION A classic IR system accepts the user information need in a form of query and gives back the documents that are relevant to the user need. With the explosion of knowledge on the web, it became necessary to break the language barriers for the monolingual IR systems. This may allow the users of IR systems to give query in one language and retrieve documents in different languages. IR system, with different source and target language is called CLIR system. Cross-Lingual Information Retrieval (CLIR) translates the user query (given in source language) into the target language, and uses translated query to retrieve the target language documents. The drive for evaluation of monolingual and cross-lingual retrieval systems started with Cross-Language Evaluation Forum (CLEF) in European languages and NTCIR in Chinese-Japanese-Korean languages. It is only in the recent past that the Indian languages have gained importance in evaluation. From 2008, a specific campaign focusing on Indian languages started with the Forum for Information Retrieval Evaluation (FIRE). This resulted in the development of large document collection in some Indian languages like Bangla, Hindi, Marathi and Tamil. Through our paper we like to provide a brief review of the work done by various researchers in the field of Indian languages for CLIR system. The paper is organized as follows: section 2 illustrates different techniques used for query translation. Comparative analysis of CLIR approaches in Indian languages perspective is discussed in section 3. Section 4 describes our prototype for query translation and sense disambiguation while section 5 draws the conclusion.
2. DIFFERENT TECHNIQUES FOR CLIR
Based on different translation resources, three different techniques have been identified in CLIR: Dictionary based CLIR, Corpora based CLIR and Machine translator based CLIR.
2.1 Machine Translation Machine translation, in simple terms, is a technique that makes use of software that translates text from one language to another language. But machine translation is not all about substitution of words from one language to another only; rather it also involves finding phrases and its counterparts in target language to produce good quality translations. Machine translation is of three types: 2.1.1 Rule Based Machine Translation Rule based MT uses linguistic information about source and target language. M. Nimaiti and Y. Izumi (2012) developed Japanese Uighur machine translation system using rule based approach. They propose a word-for-word translation system using subject verb agreement in Uighur. The results aren‟t positive and there are still some rooms for improvement. In case of Indian languages, R.Rajan et. al.(2009) propose a rule based system for translating English sentences to Malayalam by utilizing dependencies from parser, POS tagger and transfer link rules for reordering and rules for morphology. 2.1.2 Statistical Machine Translation
Statistical machine translation generates translations using statistical methods based on bilingual text corpora. Dan Wu
2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 47
& Daqing He (2010) conducted a series of CLIR experiments using Google Translate for translating queries. Their results show that with the help of relevance feedback, MT can achieve significant improvement over the monolingual baseline, no matter whether the query length are short or long. Kraaij & Simard(2003) experimentally claim that web can be used for automatic construction of parallel corpus which can then be used to train statistical translation models automatically. 2.1.3 Example Based Machine Translation Example based MT reads similar examples in the form of source text and its translation from the set of examples, adapting the examples to translate a new input. Sato and Nagao (1990) investigated the problem of example selection by approximate matching of input sentences and example sentences, using a similarity measure based on the syntactic similarity of dependency tree structures of a sentence pair in question and on the word distance of corresponding words, which were predefined in a thesaurus. Sumita et al. (1990) looked into example-based translation of Japanese noun phrases of the pattern [N1 no N2] into English as [N2 prep N1] or [N1 N2], based on a distance measure for the input phrase and example phrase, calculated as a linear weighted sum of the distances of the three sub-parts, each of which is predefined in a thesaurus. 2.2 Dictionary Based CLIR The most natural approach to cross-lingual IR is to replace each query term with most appropriate translations extracted automatically from Machine Readable Dictionaries (MRD). The translation using bilingual dictionaries is simple but Ballesteros and Croft (1996) and Hull & Grefenstette(1996) claim that it leads to a 40-60% loss in effectiveness as compared to monolingual retrieval. A.Pirkola (2001) asserts that the loss can be due to factors as untranslatable search keys due to limitations in dictionaries, processing of derived or inflected word forms, phrase and compound translation and lexical ambiguity in source and target languages. To handle these problems, researchers have made use of domain specific dictionaries for the dictionary coverage problem( Pirkola, 1998, 1999), Stemming and morphological analysis to handle inflected words(Hull, 1996, Krovetz, 1993; Porter, 1990), POS tagging for phrase translation(Ballesteros & Croft, 1997), corpus based query expansion (Ballesteros & Croft, 1998; Nie et al. , 1999; Sheridan et. al., 1997) and query structuring for the ambiguity problem(Pirkola, 1998, 1999; Sperer & Oard, 2000).
2.3. Corpus Based Cross Lingual Information Retrieval Corpus based CLIR methods use multilingual terminology derived from parallel or comparable corpora for query translation and expansion. There are two types of corpus: 2.3.1 Parallel Corpus A parallel corpus is a collection where texts in one language are aligned with their translations in another language. Several systems have been developed to mine large parallel corpora from the web. Wang and Lin (2010) give a method which first identifies a set of seed URLs and crawl candidate bilingual websites. The obtained pages are cleaned and bilingual texts collected to construct comparable corpora. Wang et. al. (2004) exploit the bilingual search result pages obtained from a real search engine as a corpus for automatic translation of unknown query terms not included in the dictionary. They propose a PAT-tree based local maxima method for effective extraction of translation candidates. The approach gives excellent results. 2.3.2 Comparable Corpus Comparable corpus, on the other hand, consist of texts that are not translations, but share similar topics. They can be, e.g., newspaper collections written in the same time period in different countries. Sadat Fatiha (2011) exploit the idea of using multilingual based encyclopedias such as Wikipedia to extract terms and their translations to construct a bilingual ontology or enhance the coverage of existing ontologies. The method show promising results for any pair of languages. Qian & Meng (2008) expanded Chinese OOV phrase with its partial English translation and submitted to the search engine. The translation of OOV words is mined by preprocessing the snippets obtained to extract the main text from the web page. The strings obtained are sorted by weighted frequency to output the top n translation of OOV phrase. The method proves to obtain the translation with high time efficiency and high precision. 3. COMPARATIVE ANALYSIS OF CLIR APPROACHES FOR INDIAN LANGUAGES Cross-language retrieval is a budding field in India and the works are still in its primitive state. Table 1 analyzes the performance of various approaches used by the researchers for Indian languages. In many approaches the cross-lingual results are comparable to that of mono-lingual approaches.
Table 1: Critical Analysis of CLIR for Indian Languages
Languages
Translation
Size of test data/ Performance
Specific Features
English to Hindi A.Seetha , S.Das & M. Kumar (2007)
Select first equivalent/ preferred –n/ random nth equivalent/ all equivalents from
6219 hindi document test collection/ performance of strategy 1,2,3,4 are 64.80%, 57.90%, 11.83% and 57.13% of monolingual retrieval
The four strategies are used to test the system performance on the number of equivalents in the query translation by selecting n equivalents from the list of the dictionary.
3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 48
Bilingual dictionary
Tamil to English S. Saraswathi & A. Siddhiqaa (2010)
Machine translation and Ontological tree
200 documents from the domain “festival”/ relevance improves by 40% for English and 60% for Tamil
A generic platform is built for bilingual IR which can be extended to any foreign or Indian language working with the same efficiency.
English to Hindi
A. Seetha, S. Das, J. Rana & M. Kumar
(2010)
Translation by Shabdanjali dictionary & query expansion by Hindi Wordnet.
Fire 2010 Hindi test collection/ method is not very effective
Query expansion reformulates the initial query by adding some new related words so that query provides a wider coverage than the original query.
English to Malayalam P.L. Nikesh, S.M. Idicula, David Peter (2008)
Bilingual dictionary developed in house.
System proves to be efficient for CLIR
A basic system can be constructed quickly once the linguistic tools become available.
English to Hindi Larkey & Connell (2003)
Probabilistic dictionary derived from parallel corpus
41697 Hindi news articles/ method contributes to effective Hindi retrieval
It combines the ranked lists from the Inquery search and the Language Modeling search to obtain the final ranking of retrieved documents.
English to Hindi & Hindi to English S. Sethuramalingam & V. Varma (2008)
Bilingual Dictionary
English corpus consisted of 125,638 news articles from the Telegraph, Calcutta edition while Hindi corpus consisted of 95215 news articles published in Jagran/ English-Hindi CLIR performance is 58% while Hindi-English CLIR is 25% of the monolingual performance
Disjunctive query formulation using weighted keywords give an overall better performance in both CLIR and Multi Lingual scenario.
Tamil to English
Bilingual dictionary
Web/ The approach used improves the significance of the content retrieved and the overall efficiency of the process
Using summarization techniques and snippet clustering the result closet to user‟s query is displayed.
Bengali & Hindi to English D. Mandal & P. Banerjee (2007)
Machine Translation using Bilingual dictionary
English news corpus of LA Times 2002 containing 135153 documents/ Map for Bengali-English queries is 7.26 & for Hindi-English queries is 4.77
Queries with named entities provided better results as compared to the queries without named entities implying the importance of a very good bilingual lexicon and transliteration tool in CLIR for Indian languages.
Tamil to English D.Thenmozhi & C. Aravindan (2010)
Machine Translation
Agricultural ontology/ Retrieves pages with MAP of 95%
The system exhibits a dynamic learning approach wherein any new word that is encountered in the translation process could be updated to the bilingual dictionary.
Hindi to English R. Udupa & J. Jagarlamudi (2008)
Probabilistic translation lexicon produced by Statistical Machine Learning
Parallel corpus consisting of 100K sentence pairs from the news domain/ Retrieval performance is about 81% of that of monolingual system
Transliteration mining of OOV words from the document performance whereas date restriction hurts the retrieval performance.
Hindi to telugu to English P.Pingali & V.Verma (2006)
Bilingual Dictionary
English news corpus of LA Times 1995 containing 113005 documents & 56472 documents from Glasgow Herald of 1995/ The system is much robust
Simple techniques such as dictionary lookup with minimal lemmatization such as suffix removal is not sufficient for Indian Languages CLIR.
English to Bangla
A.Imam & S.
SMT using parallel corpus
English to Bangla corpus of approximately 29000 sentences/
Improving corpus quality is about 3 times effectual than
4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 49
Chowdhury (2011)
NIST & BLUE scores (scoring system for evaluating the performance of a Machine Translation System.)are 4.6 and 0.39 which is below the standard
increasing the corpus size for English-Bengali SMT.
Tamil to English Pattabhi R.K Rao and Sobha. L (2010)
Bilingual dictionary and ontology
125638 documents from English news magazine “The Telegraph” / Results are encouraging
The system performs well for queries for which the world knowledge has been imparted.
3.1 Observation Cross lingual information retrieval for foreign languages like English, French, Chinese etc. has been an appealing area for researchers from long time. But Indian languages have grabbed attention only a decade back. The work done by researchers show mixed results in terms of improvement over monolingual retrieval in Indian language perspective. Anurag Seetha & S. Das (2010) performed translation on Fire 2010 Hindi test collection using Shabdanjali dictionary & query expansion by Hindi Wordnet. The method proved to be ineffective. It is because general dictionaries have low coverage problem. To remove this inefficiency Larkey and Connell (2003) used probabilistic dictionary derived from parallel corpus for English to Hindi translation and achieved effective cross lingual retrieval. Pattabhi R.K Rao and Sobha. L. (2010) found encouraging results by incorporating Bilingual dictionary and ontology. Other researchers have made use of machine translation for cross lingual retrieval. D.Thenmozhi & C. Aravindan (2010) used MT on agricultural domain and retrieved pages with MAP of 95%. MT systems produce high quality translations only in limited domains and are very expensive too. It involves the cost of creating bilingual dictionary, parallel corpora and the construction and evaluation of MT system. R. Udupa & J. Jagarlamudi (2008) used Probabilistic translation lexicon produced by Statistical Machine Learning while A.Imam & S. Chowdhury (2011) used SMT using parallel corpus for English to Bangla translation. Parallel or comparable corpora are yet other useful resources for CLIR. Parallel corpora are preferred in CLIR because they provide more accurate translation knowledge but due to their scarcity, comparable corpora are often used in CLIR. The above observation concludes that there is a wide scope of research to improve existing algorithms or developing new one to improve the performance level of CLIR system. 4. PROTOTYPE APPROACH In this section we propose an approach for cross-lingual information retrieval on the web and briefly discuss the components of the proposed design. The major components of the design are: Preprocessing, Query translation, Word sense disambiguation and Information Retrieval. Before we start discussing the major components of the system, we need to know the grammatical complexities of the two languages.
4.1 Grammatical Complexities of English to Hindi Translation
Hindi and English are morphologically different languages. Translating from poor (e.g. English) to rich (e.g. Hindi) morphology is a tough job and requires deeper linguistic investigation during translation. The major differences are: (i) The basic word order in Hindi is Subject-Object-Verb (SOV) as against SVO word order in English. But in Hindi, the constituents of a sentence can be freely moved around in the sentence without affecting the core meaning. E.g. the following sentence pair conveys the same meaning with different word order: राम ने सीता को देखा Ram ne Sita ko dekha सीता को राम ने देखा Sita ko Ram ne dekhaa The identity of Ram as the subject and Sita as the object in both sentences comes from the case markers ने (ne – nominative) and को (ko –accusative) (ii) Unlike English, vowel length and Vowel nasalization are meaningful in Hindi e.g. (Kam) means „less‟ and (Kaam) means „work‟ (Puuch) means „ask‟ and (puunch) means „tail‟ (iii) In English, prepositions precede the words to which they relate. In Hindi, such words are called postpositions because they follow the words they govern. (iv) Hindi is morphologically richer than English. This can be illustrated from following example: The plural-marker in the word “boys” in English is translated as ए (e – plural direct) or ओं (on – plural oblique): The boys went to school ऱड़के पाठशाऱा गये The boys ate apples. ऱड़को ने सेब खाये
5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 50
Future tense in Hindi is marked on the verb. In the following example, “will go” is translated as जायेंगे (jaaenge), with एंगे (enge) as the future tense marker: The boys will go to school. ऱड़के पाठशाऱा जायेंगे (v) There are no articles in Hindi. Definiteness of a noun is indicated through pronoun, context or word order. (vi) All nouns in Hindi are either masculine or feminine. This means an arbitrary gender is assigned to the nouns that have a neutral gender in English e.g. „chair‟ is a feminine noun and „door‟ is a masculine noun in Hindi. 4.2 Preprocessing The first step in any CLIR system is preprocessing of query terms to speed up the translation process without affecting the retrieval quality. This preprocessing is done using tokenization, stemming and stop word removal. 4.2.1 Tokenization Tokenization is defined as an attempt to recognize the boundaries between words and isolate those parts of a query which should be translated in the source query. 4.2.2 Stop Word Removal Stop Words are words which do not contain important significance in Search Queries and hence can be removed from the query to increase search performance. Removing stop words can be done using a list that contains all stop words. 4.2.3 Stemming It maps all the different inflected forms of a word to the same stem. For languages like English which have weaker inflections, simple stemming algorithms can be used. Such algorithms only remove plural endings. In languages with stronger inflections, suffices are joined to the stem end to end. The advanced stemming algorithm can recognize such multiple endings and remove them in an iterative fashion. Porter stemmer, Snowball stemmer etc. are well known advanced stemming algorithm. 4.3 Query Translation
In Query Translation, the given query is converted from Source language to Target language and the obtained query searches the database to get the documents in Target language. Query Translation often suffers from the problem of translation ambiguity and this problem is amplified due to the limited amount of context in short queries. Query translation can be done using any one technique including machine translation, dictionary based or corpus based method. The techniques have already been discussed in section 2. The query translation is quiet complex while translating English to Hindi query as the two languages are morphologically different from each other. Out of vocabulary (OOV) words are not translated even after morphological analysis. This type of words can be transliterated using the target language alphabet and be added to final queries. Not much work has been done for the translation of these two languages by Indian researchers till date.
4.4 Ambiguity Removal in Translated Query Ambiguity is a common problem with all natural languages i.e. there exist a large number of words in these languages carrying more than one meaning. For instance, the English noun plant can mean green plant or factory or the word bank means financial institution or pool of a river. The correct sense of an ambiguous word can be selected based on the context where it occurs. This task of automatically assigning the most appropriate meaning to a polysemous word within a given context is called word sense disambiguation. Disambiguation algorithms use a variety of resources and follow different techniques. On the basis of resource utilization and their processing techniques, the disambiguation techniques can be classified as Knowledge Based Methods (resources used are Machine Readable Dictionaries, Thesaurus, Lexicons ), Supervised Learning Methods (Naïve Bayesian Classifier, Exemplar Based Classifier, Lazy Boosting Algorithm), Minimally Supervised Methods and Unsupervised Methods. 4.5 Information Retrieval after Query Translation and Ambiguity Removal The retrieval system presents the user a set of documents that match his query. The retrieval model is of three types: The Boolean, Vector Space and Probabilistic model. In Boolean model, queries are represented as Boolean expressions and only those documents that logically match the query is presented to the user leaving behind those documents that do not match at all. The major drawback with this model is that it only judges documents completely matching or not and does not determines the degree of matching. The other two methods present the ranked list of documents depending on the degree of matching. Vector Space method calculates the degree of matching by calculating the angle between the query vector and each document vector. The Probabilistic model estimates the probability that a document is relevant for the query on the basis of the assumption that the probability depends on the query and the document representation only. Step by Step Evaluation of CLIR Based on Prototype Approach The steps of the proposed approach can be explained by considering the following queries: Query 1: Hunger Strikes Tokenization- Using whitespace between words the tokens obtained from the query are „Hunger‟ and „Strikes‟
6. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 51
Stop Word Removal- No stop words exist in the above query. Stemming- Next using Porter stemmer, the inflected tokens are reduced to their base form. After stemming, query becomes Hunger strike. Query translation- The required translation of the query is „भूख हड़ताऱ’. Ambiguity Removal- Since the translated query is unambiguous, so no disambiguation is required. Precision- The precision of the query is .83, where the number of relevant documents is 10 out of top 12 retrieved documents appeared on first page. Query 2: Alcohol Consumption in India Tokenization- Tokens of the above query are „Alcohol‟, „consumption‟, „in‟ and „India‟. Stop word removal- Next stop word „in‟ is removed using stop word list given by MIT. The query now becomes Alcohol consumption India Stemming- Stemming using Porter stemmer returns the query as Alcohol consumpt India Query Translation- Hindi translation of the query is „भारत में शराब की खपत’. Ambiguity removal- The Hindi translation “भारत” is ambiguous i.e. it has multiple senses. It refers to country India as well as the son of Pandu, a Mahabharat character. The correct sense of a word can be identified based on the context of the query in which it appears using disambiguation algorithm. Precision- The precision of the query is 1.0, where the number of relevant documents is 10 out of 10 retrieved documents appeared on first page. The queries have been preprocessed and translated manually using tools like Potter stemmer, Stop Word list by MIT etc. and received positive results. Based on suggested approach we will formulize an algorithm for English to Hindi language query translation for CLIR. 5. CONCLUSIONS The respective work with regard to Indian languages has gained impetus in last decade and there is much to be explored in this field. It is quite obvious from the observations that there is still a scope of improvement in the performance level of CLIR. We presume that the proposed prototype system will prove to be competent with other existing systems.
REFERENCES
[1]. Ballesteros, L, and Bruce W Croft, “Phrasal Translation and Query Expansion Techniques for Cross Language Information Retrieval”. In: Proceedings of 20th International ACM SIGIR Conference in Research and Development in IR 1997. [2]. Ballesteros, L., and Croft, W.B. 1998. “Resolving ambiguity for cross-language retrieval.” In Proceedings of SIGIR Conference, pages 64-71, 1998. [3]. Chawre, S. M., Srikantha Rao. Domain Specific Information Retrieval in Multilingual Environment, International Journal of Recent Trends in Engineering, 2, 4, 179-181, 2009. [4]. Chinnakotla Kumar Manoj, Ranadive Sagar, Bhattacharyya Pushpak and Damani P. Om “Hindi and Marathi to English Cross Language Information Retrieval at CLEF 2007”, in the working notes of CLEF 2007. [5]. David A. Hull and Gregory Grefenstette. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th International Conference on Research and Development in Information Retrieval, pages 49–57, 1996. [6]. Dr. Saraswathi, S., Asma Siddhiqaa, M., Kalaimagal, K., and Kalaiyarasi M. BiLingual Information Retrieval System for English and Tamil, Journal Of Computing, 2,4, 85-89, April 2010. [7]. Grefenstette, G. (1998b). The problem of cross-language information retrieval. In Grefenstette (1998a), pages 1-9. [8]. Hsu Hung Ming, Tsai Feng Ming, and Hsin-Hsi Chen Query Expansion with ConceptNet and WordNet: An Intrinsic Comparison. In : AIRS 2006, LNCS 4182, (2006) 1-13. [9]. Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Ma. Query Expansion by Mining User Logs IEEE. Transactions on Knowledge and Data Engineering, Vol. 15(4) 2003. [10]. Hiemstra, D. And De Jong, F. 1999. “Disambiguation strategies for cross-language information retrieval”. In Proceedings of the Third European Conference on Research and Advanced Technology for Digital Libraries. 274-293. [11]. Jagadeesh Jagarlamudi and Kumaran, A. Cross- Lingual Information Retrieval System for Indian Languages, Proceedings of CLEF 2007, 2007. [12]. Kishida, K. (2005). Technical issues of cross-language information retrieval: a review. Inf. Process. Manage., 41(3):433- 455. [13]. Nakazawa, S. Ochiai, T. Satoh K., and Okumura A. Cross language Information Retrieval based on Comparable Corpora. In: Proceedings of the first NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition (NTCIR1) 1999. [14]. Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., and Järvelin, K. (2003). Fuzzy translation of cross-lingual spelling variants. In SIGIR ‟03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 345-352, New York, NY, USA. ACM. [15]. Pattabhi R. K. Rao., and Sobha, L. Cross Lingual Information Retrieval Track: Tamil – English, Working notes from FIRE 2010, Feb 2010.
7. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Special Issue: 10 | NCCOTII 2014 | Jun-2014, Available @ http://www.ijret.org 52
[16]. Prasad Pingali and Vasudeva Varma, “IIIT Hyderabad at CLEF 2007 - Adhoc Indian Language CLIR task”. In: CLEF- 2007, Cross Language Evaluation Forum 2007 Workshop at Budapest Hungary. [17]. Pingali, P., Varma, V., “Hindi and Telugu to English Cross Language Information Retrieval”, Cross Language Extraction Forum(CLEF), 2006. [18]. Sperer, R. and Oard, D. 2000. “Structured query translation for cross-language information retrieval.” In Proceedings of the ACM SIGIR Conference. ACM, New York, 2000.
[19]. Seetha Anurag, Das Sujoy, Kumar M., Evaluation of the English-Hindi Cross Language Information Retrieval System Based on Dictionary Based Query Translation Method. In: Proceedings of 10th International Conference on Information Technology (ICIT2007). Available at http://doi.ieeecomputersociety.org/10.1109/ICIT.2007.40. [20]. Thenmozhi, D., and Aravindan, C. Tamil-English Cross Lingual Information Retrieval System for Agriculture Society, International Forum for Information Technology in Tamil Conference, October 2009.