Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
The document proposes using part-of-speech tagging and stemming to improve Gujarati to Hindi machine translation through transliteration. It presents a system that applies stemming and POS tagging to Gujarati text before transliterating to resolve ambiguities. An evaluation of the system on 500 sentences found that transliteration and translation matched for 54.48% of Gujarati words, and overall transliteration efficiency was 93.09%. The approach aims to improve over direct transliteration for the highly inflected Gujarati language.
This document presents an efficient rule-based system for morphological parsing of the Tamil language. It discusses the agglutinative nature of Tamil morphology and the need for morphological analysis in applications such as machine translation. The proposed system uses a combination of rule-based and machine learning approaches to analyze Tamil words and identify their root forms and inflections. It was implemented using resources like the EMILLE corpus and Tamil WordNet and allows for morphological parsing of Tamil texts.
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
The Paper presents a new method for translating a text document from Tamil to English. Our method is based on the Statistical Machine Translation Approach, combined with the Morphological Analysis, due to the fact that Tamil is a highly-inflected language. This paper presents a slight modification in SMT to make the approach more efficient and effective, and the experimental results have proven the method to be speed and accurate in the translation process.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
Transliteration may be defined as the process of mapping sounds in a text written in one language to
another language. Current paper discusses about transliteration and its use in Named Entity Recognition.
We have designed a code that executes Transliteration and assist in the process of Named Entity
Recognition. We have presented some of the results of Named Entity Recognition (NER) using
Transliteration
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
The document proposes using part-of-speech tagging and stemming to improve Gujarati to Hindi machine translation through transliteration. It presents a system that applies stemming and POS tagging to Gujarati text before transliterating to resolve ambiguities. An evaluation of the system on 500 sentences found that transliteration and translation matched for 54.48% of Gujarati words, and overall transliteration efficiency was 93.09%. The approach aims to improve over direct transliteration for the highly inflected Gujarati language.
This document presents an efficient rule-based system for morphological parsing of the Tamil language. It discusses the agglutinative nature of Tamil morphology and the need for morphological analysis in applications such as machine translation. The proposed system uses a combination of rule-based and machine learning approaches to analyze Tamil words and identify their root forms and inflections. It was implemented using resources like the EMILLE corpus and Tamil WordNet and allows for morphological parsing of Tamil texts.
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
The Paper presents a new method for translating a text document from Tamil to English. Our method is based on the Statistical Machine Translation Approach, combined with the Morphological Analysis, due to the fact that Tamil is a highly-inflected language. This paper presents a slight modification in SMT to make the approach more efficient and effective, and the experimental results have proven the method to be speed and accurate in the translation process.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
Transliteration may be defined as the process of mapping sounds in a text written in one language to
another language. Current paper discusses about transliteration and its use in Named Entity Recognition.
We have designed a code that executes Transliteration and assist in the process of Named Entity
Recognition. We have presented some of the results of Named Entity Recognition (NER) using
Transliteration
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISHijnlc
Machine translation is being carried out by the researchers from quite a long time. However, it is still a
dream to materialize flawless Machine Translator and the small numbers of researchers has focussed at
translating Marathi Text to English. Perfect Machine Translation Systems have not yet been fully built
owing to the fact that languages differ syntactically as well as morphologically. Majority of the researchers
have opted for Statistical Machine translation whereas in this paper we have addressed the challenges of
Rule based Machine Translation. The paper describes the major divergences observed in language
Marathi and English and many challenges encountered while attempting to build machine translation
system form Marathi to English using rule based approach and rules to handle these challenges. As there
are exceptions to the rules and limit to the feasibility of maintaining knowledgebase, the practical machine
translation from Marathi to English is a complex task.
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
This document summarizes a research paper on developing a Punjabi to English machine transliteration system using statistical machine translation. It discusses how existing transliteration systems between other languages use rule-based or hybrid approaches and have accuracies ranging from 73% to 95%. The proposed system aims to increase accuracy by using statistical machine translation techniques to learn from existing transliterated data and select the most probable transliteration when multiple options exist. It will help translate documents in the Punjabi language, which is official in Punjab, into English for international understanding.
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODijait
This document describes the development of a part-of-speech tagger for Marathi text using a trigram statistical approach. The trigram method assigns POS tags to words based on the probabilities of tag transitions given the previous two tags. The tagger was evaluated on a test corpus of 2000 sentences and achieved an accuracy of 91.63%. Future work will aim to improve accuracy by expanding the training corpus with more tagged sentences. The document also provides background on previous work developing POS taggers for other Indian languages and challenges in tagging morphologically rich languages like Marathi.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
This paper proposes a semi-supervised learning approach to detect jargon words in text. It handles jargon words directly in the text as well as abbreviated forms like sounds-alike words. It uses a sliding window technique to detect suspicious words that partially match jargon words. A learning methodology assigns probabilities to suspicious words based on the concept derived from the text and stores them with a counter. Words are marked as jargon when the probability passes a threshold.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerIOSR Journals
This document summarizes an implementation of an English-text to Marathi-speech synthesizer. The synthesizer uses a unit selection approach based on concatenative synthesis to produce natural sounding Marathi speech from English text input. Over 28,000 Marathi syllables, words and sentences were recorded from a female speaker and used to create the speech corpus. Formant frequencies (F1, F2, F3) were analyzed from the synthesized speech using MATLAB and PRAAT tools to evaluate the quality and naturalness of the output.
This paper presents a machine translation system that translates simple assertive English sentences to Marathi sentences. The system performs morphological analysis, part-of-speech tagging, and local word grouping to convert the meaning of the English sentence to the corresponding Marathi sentence. An English to Marathi bilingual dictionary is used for translation. The system aims to help people with primary education understand English words by providing translations to their native Marathi language.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
This document describes a rule-based machine translation system for translating English text to Telugu. It discusses the challenges of developing such a system, including differences in grammar between the two languages. An algorithm is proposed that uses rules, probabilities, and rough sets to classify sentences and select the best word translations. The system works by tokenizing English sentences, tagging the words with parts of speech, looking up word translations in a bilingual dictionary, and concatenating the Telugu words to form the output sentence.
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLijfcstjournal
Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.
This document describes a factored statistical machine translation system from English to Tamil that incorporates Tamil morphology. The system first reorders and factors the English text, then uses morphological analysis and generation tools for Tamil to further factorize the text. This addresses challenges of translating between languages with different morphological structures and word orders. The system was shown to improve over a baseline SMT system for English to Tamil translation by integrating linguistic information like lemmas and morphological features.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...IJERA Editor
This research paper presents the approach towards converting text to speech using new methodology. The text to
speech conversion system enables user to enter text in Marathi and as output it gets sound. The paper presents
the steps followed for converting text to speech for Marathi language and the algorithm used for it. The focus of
this paper is based on the tokenisation process and the orthographic representation of the text that shows the
mapping of letter to sound using the description of language’s phonetics. Here the main focus is on the text to
IPA transcription concept. It is in fact, a system that translates text to IPA transcription which is the primary
stage for text to speech conversion. The whole procedure for converting text to speech involves a great deal of
time as it’s not an easy task and requires efforts.
This paper deals about the chunking of the Manipuri language, which is very highly agglutinative in
Nature. The system works in such a way that the Manipuri text is clean upto the gold standard. The text is
processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The output file is
treated as an input file for the CRF based Chunking system. The final output is a completely chunk tag
Manipuri text. The system shows a recall of 71.30%, a precision of 77.36% and a F-measure of 74.21%.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
A Review on a web based Punjabi to English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scripts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that converts
the transliterated word or phrase back into its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intelligence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a data
oriented statistical framework for translating text from one natural language to another based on the knowledge
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
This document summarizes a research paper on developing a Punjabi to English machine transliteration system using statistical machine translation. It discusses how existing transliteration systems between other languages use rule-based or hybrid approaches and have accuracies ranging from 73% to 95%. The proposed system aims to increase accuracy by using statistical machine translation techniques to learn from existing transliterated data and select the most probable transliteration when multiple options exist. It will help translate documents in the Punjabi language, which is official in Punjab, into English for international understanding.
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODijait
This document describes the development of a part-of-speech tagger for Marathi text using a trigram statistical approach. The trigram method assigns POS tags to words based on the probabilities of tag transitions given the previous two tags. The tagger was evaluated on a test corpus of 2000 sentences and achieved an accuracy of 91.63%. Future work will aim to improve accuracy by expanding the training corpus with more tagged sentences. The document also provides background on previous work developing POS taggers for other Indian languages and challenges in tagging morphologically rich languages like Marathi.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
This paper proposes a semi-supervised learning approach to detect jargon words in text. It handles jargon words directly in the text as well as abbreviated forms like sounds-alike words. It uses a sliding window technique to detect suspicious words that partially match jargon words. A learning methodology assigns probabilities to suspicious words based on the concept derived from the text and stores them with a counter. Words are marked as jargon when the probability passes a threshold.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerIOSR Journals
This document summarizes an implementation of an English-text to Marathi-speech synthesizer. The synthesizer uses a unit selection approach based on concatenative synthesis to produce natural sounding Marathi speech from English text input. Over 28,000 Marathi syllables, words and sentences were recorded from a female speaker and used to create the speech corpus. Formant frequencies (F1, F2, F3) were analyzed from the synthesized speech using MATLAB and PRAAT tools to evaluate the quality and naturalness of the output.
This paper presents a machine translation system that translates simple assertive English sentences to Marathi sentences. The system performs morphological analysis, part-of-speech tagging, and local word grouping to convert the meaning of the English sentence to the corresponding Marathi sentence. An English to Marathi bilingual dictionary is used for translation. The system aims to help people with primary education understand English words by providing translations to their native Marathi language.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
This document describes a rule-based machine translation system for translating English text to Telugu. It discusses the challenges of developing such a system, including differences in grammar between the two languages. An algorithm is proposed that uses rules, probabilities, and rough sets to classify sentences and select the best word translations. The system works by tokenizing English sentences, tagging the words with parts of speech, looking up word translations in a bilingual dictionary, and concatenating the Telugu words to form the output sentence.
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLijfcstjournal
Named Entity Recognition is the task of recognizing Named Entities or Proper Nouns in a document and then classifying them into different categories of Named Entity classes. In this paper we have introduced our modified tool that not only performs Named Entity Recognition (NER) in any of the Natural Languages,performs Corpus Development task i.e. assist in developing Training and Testing document but also solves unknown words problem in NER, handles spurious words and automatically computes Performance Metrics for NER based system i.e. Recall, Precision and F-Measure.
This document describes a factored statistical machine translation system from English to Tamil that incorporates Tamil morphology. The system first reorders and factors the English text, then uses morphological analysis and generation tools for Tamil to further factorize the text. This addresses challenges of translating between languages with different morphological structures and word orders. The system was shown to improve over a baseline SMT system for English to Tamil translation by integrating linguistic information like lemmas and morphological features.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...IJERA Editor
This research paper presents the approach towards converting text to speech using new methodology. The text to
speech conversion system enables user to enter text in Marathi and as output it gets sound. The paper presents
the steps followed for converting text to speech for Marathi language and the algorithm used for it. The focus of
this paper is based on the tokenisation process and the orthographic representation of the text that shows the
mapping of letter to sound using the description of language’s phonetics. Here the main focus is on the text to
IPA transcription concept. It is in fact, a system that translates text to IPA transcription which is the primary
stage for text to speech conversion. The whole procedure for converting text to speech involves a great deal of
time as it’s not an easy task and requires efforts.
This paper deals about the chunking of the Manipuri language, which is very highly agglutinative in
Nature. The system works in such a way that the Manipuri text is clean upto the gold standard. The text is
processed for Part of Speech (POS) tagging using Conditional Random Field (CRF). The output file is
treated as an input file for the CRF based Chunking system. The final output is a completely chunk tag
Manipuri text. The system shows a recall of 71.30%, a precision of 77.36% and a F-measure of 74.21%.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
A Review on a web based Punjabi to English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scripts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that converts
the transliterated word or phrase back into its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intelligence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a data
oriented statistical framework for translating text from one natural language to another based on the knowledge
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document describes the development of a Rule-Based Machine Translation system between Assamese and English using the Apertium platform. It discusses the files and tools used, including monolingual dictionaries for Assamese and English, a bilingual dictionary, and transfer rules. The methodology section outlines the Apertium architecture and modules, and describes how the dictionaries and transfer rules were created to translate between the two languages.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language
pair. The machine translation system will take input script as English sentence and parse with the help of
Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the
machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will
take the parsed output and separate the source text word by word and searches for their corresponding
target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also
reordering rules are there. After applying the reordering rules, English sentence will be syntactically
reordered to suit Marathi language.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
1) This document discusses stemming algorithms that have been used for the Odia language. Stemming is the process of reducing inflected words to their root or stem for purposes like information retrieval.
2) It reviews different stemming algorithms that have been applied to Odia text, including suffix stripping, affix removal, and stochastic algorithms. It also discusses common errors in stemming like over-stemming and under-stemming.
3) Applications of stemming discussed include information retrieval, text summarization, machine translation, indexing, and question answering systems. The document concludes by surveying prior work on stemming algorithms for Odia.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
Comparison of stemming algorithms on Indonesian text processingTELKOMNIKA JOURNAL
Stemming is one of the stages performed on the process of extracting information from the text.
Stemming is a process of converting words into their roots. There is an indication that the most accurate
stemmer algorithm is not the only way to achieve the best performance in information retrieval (IR). In this
study, seven Indonesian stemmer algorithms and an English stemmer algorithm are compared, they are
Nazief, Arifin, Fadillah, Asian, Enhanched confix stripping (ECS), Arifiyanti and Porter. The data used are
2,734 tweets collected from the official twitter account of PLN. First, the aims are to analyze the correlation
between stemmer accuracy and information retrieval performance in Indonesian text language. Second, is
to identify the best algorithm for Indonesian text processing purpose. This research also proposed
improved algorithm for stemming Indonesian text. The result shows that correlation found in the previous
research does not occur for the Indonesian language. The result also shows that the proposed algorithm
was the best for Indonesian text processing purpose with weighted scoring value of 0.648.
Identification and Classification of Named Entities in Indian Languageskevig
The process of identification of Named Entities (NEs) in a given document and then there classification into different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by English and the European languages. In this paper, we have presented the results that we have achieved by performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance Metrics.
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
The document describes a proposed system to translate Tamil speech to Indian Sign Language (ISL) using speech recognition and natural language processing algorithms. It aims to help hearing-impaired people communicate independently. The system would use the CMU Sphinx speech recognition tool to convert spoken Tamil to text, then apply grammar rules and machine learning to translate the text to ISL displayed through video or animated avatars. The document reviews similar existing systems and research on speech recognition and sign language translation to inform the design and implementation of the proposed Tamil-ISL system.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Similar to Quality estimation of machine translation outputs through stemming (20)
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Quality estimation of machine translation outputs through stemming
1. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
DOI:10.5121/ijcsa.2014.4302 15
QUALITY ESTIMATION OF MACHINE TRANSLATION
OUTPUTS THROUGH STEMMING
Pooja Gupta 1
, Nisheeth Joshi 2
and Iti Mathur3
1, 2, 3
Department of Computer Science, Apaji Institute, Banasthali University, Rajasthan,
India
ABSTRACT
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
KEYWORDS
Machine Translation, Stemming, Machine learning, Language Model.
1. INTRODUCTION
Machine translation is a field which is an amalgamation of areas such as Computational
Linguistics, Artificial Intelligence, Translation Theory and Statistics. Machine translation is fast
and is available on a click of a button whereas human translation is very slow, time consuming
and expensive task as compared to machine translation. But acceptance of machine translation is
very low because of reasons like bad translations in available systems and ambiguities. Human
languages are highly ambiguous, and produce different meanings in different languages. To
overcome this problem we come up with a solution of integrating multiple machine translation
engine into one i.e. we create a multi-engine machine translation system. Sometimes it also gives
bad results while selecting a final output. So, we need to rank the MT engine outputs for that
Manual ranking is acting as a human translation, it’s very tedious task. So we need to perform
automatic ranking for a large amount of data with minimum time. In order to develop an
automatic ranking system we need to develop several different modules. The very first module of
ranking system that comes in machine translation pipeline is N-gram language model. This acts as
a baseline system and second module is morphological analysis. In this Stemming/lemmatization
are performed. In N-gram LM ranking we used the trigrams approximation approach, Gupta et al.
defines this approach [1]. In Stemming based ranking, we used a Hindi Rule based Stemmer. This
Stemmer is a simplest morphological parsing system which contains some morphological
information. Morphological information is an important part when we consider the design of any
MT engine, any natural language processing application or any information retrieval system.
The rest of the paper is organized as follows: In Section2, we briefly give an overview of related
work that has been done in this area. Section 3 shows working of stemming done for Hindi
language. Section 4 describes our proposed work. In this section we also define corpus creation,
algorithms and methodology of ranking approach. Section 5 shows the evaluation and the results
of the research. Finally Section 6 gives the conclusion of the paper.
2. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
16
2. RELATED WORK
Quality estimation is an open source framework that allows checks several quality indicators.
Specia et al. [2] defines the extraction of these indicators from source segments, their translations,
external resources like corpora and language models as well as some language tools like parser
and part-of-speech tags. Soricut and Narsal [3] used machine learning for ranking the candidate
translations; they selected the highest-ranked translation as the final output. Avramidis [4]
showed an approach of ranking the outputs using grammatical features. They used statistical
parser to analyze and generate ranks for several MT output. Gupta el al. [5] [6] applied a naïve
bayes classifier on English-Hindi Machine Translation System and ranked the systems. For
evaluating the quality of the system the authors have used some linguistic features. The authors
have also compared the results with automatic evaluation metrics. Moore and Quirk [7] described
smoothing method for N-gram language models based on ordinary counts for generation language
models which can use for quality estimation task. Bharti et al. [8] proposed the work on natural
language processing where they gave a detailed study of morphology using paradigm approach.
Stemming was firstly introduced by Julie Beth Lovins [9] in 1968, who proposed the use of it in
Natural Language Processing applications. Martin Porter [10] in 1980 improved this stemmer. He
suggested a suffix stripping algorithm which is still considered to be a standard stemming
algorithm. Goldsmith [11] proposed an unsupervised approach to model morphological variants
of European languages. Ramanathan and Rao [12] used the same approach, but used some more
rules for stemming for Hindi language. Ameta et al. [13] proposed A Lightweight Stemmer for
Gujarati, they showed an implementation of a rule based stemmer of Gujarati and created rules
for stemming and the richness in morphology. They further used this stemmer in Gujarati-Hindi
machine translation system [14]. Pal et al. [15][16] developed a Hindi lemmatizer which
generates rules for removing the affixes along with the addition of rules for creating a proper root
word. Gupta et al. [17] Developed a rule based Urdu stemmer which showed an accuracy of 84%.
They used this stemmer in evaluating English-Urdu machine translation systems [18][19].
3. STEMMING FOR HINDI
Hindi is an Indo – Aryan language and is the official language of India. It is widely spoken by a
large number of people of the country. Stemming is the process of reducing a derived word into
its stem word or root word by clipping off the unnecessary morphemes. These morphemes are
known as suffixes. This suffix stripping is made by generating various rules. This is done by our
Hindi rule based stemmer. Our approach learns suffixes automatically from a large vocabulary or
dictionary of words extracted from raw text. This vocabulary is known as an exhaustive lexicon
list; which contains only root words and derivational words. The purpose of stemming is to obtain
the stem of those words which are not found in vocabulary. If stemmed word is present in
vocabulary, then that is an actual word, otherwise it may be a proper name or some invalid word.
Stemming is used in Information Retrieval systems where input words do not match vocabulary.
For example, when a user enters an input word सफलता and if the input word is not present in the
vocabulary of the database then it may cause erroneous result. With the help of a stemmer, one
can reduce the desired word into its root or stem word. In this example सफल is the stem or root
word and ता is the suffix. Stem supplies the main meaning of the word while the suffixes add
additional meanings.
4. PROPOSED WORK
Our proposed approach is based on n-gram language models. N-gram language models use the
Markov assumption to break the probability of a sentence into the product of the probability of
each word, given the history of preceding words. We have used Markov chains of order 2 which
3. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
17
are called as trigram approximations. N-gram language models are based on statistics of how
likely words are to follow each other. Equations 1, 2 and 3 show the generation of unigram,
bigrams and trigrams respectively.
(1)
(2)
(3)
4.1. Corpus Creation and Experimental Setup
The approach for creation of corpus is based on language modelling. In language modelling, we
have computed the probability of a string then firstly we have been collecting a large amount of
text and obtained trigrams along with their number of occurrences or frequency. We have created
our ranking system mainly for raw text of tourism domain. However, the corpus also includes
words from dictionaries available. It is actually our Bilingual parallel corpus. We used a total of
35000 Hindi sentences giving a total of 513910 unigrams, 308706 bigram word units, and 53062
trigram word units. Another corpus that we have created that is stem corpus of 35000 Hindi
sentences.Table1 shows Stemmed trigram corpus of an English sentence and its Hindi translation.
English Sentence: Indians must take protective actions to protect their freedom
Hindi Sentence:
|
Table1: Stemmed Corpus
S.No. Hindi Trigrams Stem Trigrams
1 भारतीय को अपनी
2 को अपनी
3
4
5 के
6
7
8
कदम
9
उठाने
10 कदम उठाना
4. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
18
We have used the following algorithms to generate the n-grams for our study. We also generated
the stems of corresponding n-grams. We applied these algorithms on both English as well as
Hindi sentences separately. These algorithms are shown in Table 2 and Table 3.
Input: Raw sentences
Output: Annotated Text (N-grams text)
Table2: LM Algorithm
Step1. Input raw sentence file and repeat steps 2 to 4 for each sentence.
Step2. Split each word of the sentence.
Step3. Generate trigrams, bigrams and unigrams for the entire sentence.
Step4. If n-gram is already present than increase the frequency count.
Step5. If n-gram is unique than it will sort in descending order by their frequencies.
Step6. Generate Probability of unigrams using equation 1.
Step7. Generate Probability of bigrams using equation 2.
Step8. Generate Probability of trigrams using equation 3.
Step9. Output obtained in file is in our desired n-gram format.
Input: N-grams text
Output: Stems text
Table3: Stemming Algorithm
Step1. Input the n-gram word.
Step2. Matching the word in database.
Step3. If the word exists in the database then it is displayed as output.
Step4. If word doesn’t exist in the database then the rules are accessed or stripping out the
suffix.
Step5. Rules work by deleting the suffix from the input.
Step6. Obtained Output in our desired stem format is shown in table 1.
In our study we have used 1320 English sentences and used six MT engines which were used by
Joshi [20] in his study. The list of engines is shown in table 4. Among these E1, E2 and E3 are
MT engines freely available on the internet. E4, E5 and E6 are MT engines that we developed
using different MT toolkits. E4 was a MT system which was trained using Moses MT toolkit
[21]. This system used syntax based model [22]. We used Collins parser to generate parses of
English sentences and used a tree to string model to train the system. E5 was a simple phrase
based MT system which also used Moses MT toolkit. E6 was an example based MT system that
was developed by Joshi et al. [23] [24]. These three systems used the 35000 English-Hindi
parallel corpora to train and tune themselves. We used 80-20 ratio for training and tuning i.e. we
used 28000 sentences to train the systems and remaining 7000 sentences to tune the systems.
5. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
19
Table 4. MT Engines
4.2. Methodology
To rank MT outputs of the various systems we first generated the trigrams of English sentence as
well as its translations produced by different MT engines. After that we applied stemming
algorithm and got stemmed sentences. Then we generated the stem trigrams of all translations. To
rank the translations we applied the following algorithm:
Input: English Sentence with MT outputs
Output: Ranked MT output list
Ranking Algorithm
Step1. Trigrams from English sentences are generated.
Step2. These trigrams are matched with English language model and matched ones are retained.
Step3. Match retained English trigram’s lexicons with English-Hindi parallel lexicon list and it
match with Hindi stem trigram’s lexicon list.
Step4. If a match is found then register corresponding Hindi stem trigram lexicon.
Step5. Match Hindi language model with registered Hindi stem lexicons and sum the
probabilities of each match.
Step6. Perform these steps on all MT outputs.
Step7. Sort MT outputs in descending order with respect to their cumulative probabilities.
To have a better understanding of the functionality, we have illustrated the entire process through
the following example.
Sentence: The Indian Himalayan range is undoubtedly one of the most spectacular and
impressive mountain ranges in the world.
E1 Output: भारतीय सबसे शानदार और
से एक है।
E2 Output: भारतीय बेशक सबसे शानदार और
से एक है.
1
http://www.microsofttranslator.com
2
http://translate.goolge.com
3
http://translation.babylon.com
Engine
No.
Description
E1 Microsoft Bing MT Engine1
E2 Google MT Engine2
E3 Babylon MT Engine3
E4 Moses Syntax Based Model
E5 Moses Phrase Model
E6 Example Based MT Engine
6. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
20
E3 Output: भारतीय का यह से एक सबसे एवं
शृंखलाओं ।
E4 Output: यहाँ भारतीय है undoubtedly एक के के पहाड़ी और
impressive के world.
E5 Output: The Indian Himalayan एक undoubtedly है के सबसे spectacular और
impressive mountain ranges के है ।
E6 Output: भारतीय है समूचे एक सबसे देखते बनती और
।
Table 5 shows the n-gram statistics of these sentences and also shows the sum of cumulative
probabilities of these trigrams. By looking at the data we can rank the system according to their
probabilities.
Table 5. MT Systems
Engine Unigrams Bigrams Trigrams Prob. Sum
E1 16 15 14 0.843723
E2 17 16 15 0.843723
E3 17 16 15 0.574318
E4 18 17 16 0.0
E5 21 20 19 0.293709
E6 18 17 16 0.463309
5. EVALUATION
To evaluate the performance of our system we collected 1300 sentences from tourism domain.
These sentences were not part of 35000 sentences that were used to train the models. To validate
our results we compared the ranks of the system with the ranks given to MT systems by a human
evaluator. The human evaluator used a subjective human evaluation metric that was developed by
Joshi et al. [25]. This metric evaluated an MT output on eleven parameters. These were:
1. Translation of Gender and Number of the Noun(s).
2. Identification of the Proper Noun(s).
3. Use of Adjectives and Adverbs corresponding to the Nouns and Verbs.
4. Selection of proper words/synonyms (Lexical Choice).
5. Sequence of phrases and clauses in the translation.
6. Use of Punctuation Marks in the translation
7. Translation of tense in the sentence
8. Translation of Voice in the sentence
9. Maintaining the semantics of the source sentence in the translation
10. Fluency of translated text and translator’s proficiency
11. Overall quality of the translation
7. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
21
Each MT outputs were adjudged on these 11 parameters. The human evaluator was asked to give
a score on a 5-point scale. The scale is shown is Table 6.
Table 6. Human Evaluation Scale
Score Description
1 Ideal
2 Perfect
3 Acceptable
4 Partially Acceptable
5 Not Acceptable
For evaluation, we used the methodology used by Joshi et al.[26]. We evaluated the system
generated ranks with human ranks in two different categories. At first we compared the ranks of
all the systems, irrespective of their type. In second category we compared the ranks of only web
based systems and in third category we compared the ranks of only MT toolkits or system which
had very limited corpora to train and tune themselves.
In combined category, engine E1 performed better than any other MT engine. It scored the
highest rank. Out of 1300 sentences, it managed to score highest rank for 407 sentences. Engine
E2 was the second best while engines E4 did not performed so well. Table 7 shows the results of
this study.
Table 7. Ranking at Combined Category
In web-based category, again E1 and E2 performed better and were the top ranking systems while
E4 was the worst. Table 8 shows the results of this study. In MT Toolkits category, E6 performed
better than other MT engines and E4 was the worst engine. Table 9 shows the results of this
study. These ranks were similar to the ranks provided by human evaluator. Figure 2, 3 and 4
summarized these data.
Table 8. Ranking at Web-Based Category
Engine Stem LM
Ranking
Human Ranking
E1 603 587
E2 432 473
E3 235 145
Engine STEM LM
Ranking
Human Ranking
E1 407 376
E2 285 279
E3 145 140
E4 8 7
E5 256 205
E6 236 240
8. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
22
Table 9. Ranking at MT Toolkits Category
Engine Stem LM
Ranking
Human Ranking
E4 16 18
E5 234 254
E6 356 288
Figure 1. Ranking at Combined Category
Figure 2. Ranking at Web-Based Category
Figure 3. Ranking at MT Toolkits Category
9. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
23
6. CONCLUSIONS
In this paper, we have shown the effective use of language models and morphological analysis in
ranking MT systems. For this we had generated language models for English, Hindi sentences as
well as for Hindi Stemmed Text. The system described here are very simple and efficient for
automatic ranking even when the amount of available raw text is not so large. It was found that
the ranks produced by stem language model based ranking and the ranks of human judge were
similar. We found that Microsoft Bing translator was best translator among six of them as it gave
correct translated sentence. The best performance of the current system is as good as the baseline
system. The stemmed language model based ranking have a much higher accuracy than the
baseline model. The time taken by stem based ranking is maximum as compared to baseline and
human ranking. Moreover as an immediate future study we can incorporate parts of speech
tagging into language models and then perform the ranking and see if the performance of the
system improves or not.
REFERENCES
[1] Gupta, P., Joshi, N., & Mathur, I. (2013). Automatic Ranking of MT Outputs using Approximations.
International Journal of Computer Applications, 81(17), 27-31.
[2] Specia et al. (2013). QuEst- A translation quality estimation framework. Proceedings of the 51st
Annual Meeting of the Association for Computational Linguistics, Bulgaria.
[3] Soricut, R. and Narsale, S. (2012). Combining Quality Prediction and System Selection for Improved
Automatic Translation Output. In Proceedings of the Seventh Workshop on Statistical Machine
Translation, Montréal, Canada. Association for Computational Linguistics.
[4] Avramidis E. (2012). Quality Estimation for Machine Translation output using linguistic analysis and
decoding features. In Proceedings of the 7th Workshop on Statistical Machine Translation.
[5] Gupta R., Joshi N., Mathur I. (2013). Analysing Quality of English-Hindi Machine Translation
Engine Outputs Using Bayesian Classification. International Journal of Artificial Intelligence and
Applications, Vol 4 (4), pp 165-171.
[6] R. Gupta, N. Joshi, and I. Mathur (2013). Quality Estimation of English-Hindi Outputs Using Naïve
Bayes Classifier. Advances in Computing, Communications and Informatics (ICACCI), 2013
International Conference on. IEEE.
[7] Moore Robert C. and Quirk Chris (2009). Improved Smoothing for N-gram Language Models Based
on Ordinary Counts. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers.
[8] Bharti A, Chaitanya V., Sangal R. (1995). Natural Language Processing: A Paninian Perspective.
Prentice-Hall of India.
[9] Lovins J. B. (1968). Development of Stemming Algorithm, Mechanical Translation and
Computational Linguistics.
[10] Porter M. F. (1980). An algorithm for suffix stripping. Program: electronic library and information
systems, 14(3), 130-137.
[11]Goldsmith J. (2006). An algorithm for unsupervised learning of morphology. Natural Language
Engineering. 12(4), 353-371.
[12] Ramnathan A. and Rao D. (2003). A Lightweight Stemmer for Hindi. In Proceedings of EACL.
[13] Ameta J., Joshi N., Mathur I. (2011). A Lightweight Stemmer for Gujarati. In Proceedings of 46th
Annual National Convention of Computer Society of India. Ahmedabad, India.
[14] Ameta J., Joshi N., Mathur I. (2013). Improving the Quality of Gujarati-Hindi Machine Translation
Through Part-of-Speech Tagging and Stemmer-Assisted Transliteration. International Journal on
Natural Language Computing, Vol 3(2), pp 49-54.
[15] Paul S., Joshi N., Mathur I. (2013). Development of a Hindi Lemmatizer. International Journal of
Computational Linguistics and Natural Language Processing, Vol 2(5), pp 380-384.
[16] Paul S., Tandon M., Joshi N., Mathur I. (2013). Design of a Rule Based Hindi Lemmatizer. In
Proceedings of Third International Workshop on Artificial Intelligence, Soft Computing and
Applications, Chennai, India, pp 67-74.
[17] Gupta V., Joshi N., Mathur I. (2013). Rule Based Urdu Stemmer. In Proceedings of 4th International
Conference on Computer and Communication Technology. IEEE.
10. International Journal on Computational Sciences & Applications (IJCSA) Vol.4, No.3, June 2014
24
[18] Gupta V., Joshi N., Mathur I. (2014). Evaluation of English-to-Urdu Machine Translation. Intelligent
Computing, Networking, and Informatics. Springer India. pp 351-358.
[19] Gupta V., Joshi N., Mathur I. (2013). Subjective and Objective Evaluation of English to Urdu
Machine Translation. Advances in Computing, Communications and Informatics (ICACCI), 2013
International Conference on. IEEE.
[20] N. Joshi (2014). Implications of linguistic feature based evaluation in improving machine translation
quality a case of english to hindi machine translation.
http://ir.inflibnet.ac.in:8080/jspui/handle/10603/17502
[21] Koehn et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of
the Annual Meeting of the Association for Computational Linguistics, demonstration session.
[22] Hoang, H., Koehn, P., & Lopez, A. (2009). A unified framework for phrase-based, hierarchical, and
syntax-based statistical machine translation. In Proc. of the International Workshop on Spoken
Language Translation, Tokyo, Japan.
[23] Joshi N., Mathur I., and Mathur S. (2011). Translation Memory for Indian Languages: An Aid for
Human Translators, Proceedings of 2nd International Conference and Workshop in Emerging Trends
in Technology
[24] Joshi, N., and Mathur, I. (2012). Design of English-Hindi Translation Memory for Efficient
Translation. In Proc. of National Conference on Recent Advances in Computer Engineering, Jaipur,
India.
[25] Joshi N., Mathur, I., Darbari H., and Kumar A. (2013). HEval: Yet Another Human Evaluation
Metric. International Journal of Natural Language Computing, Vol 2, No 5, pp 21-36.
[26] Joshi, N., Darbari, H, and Mathur, I. (2012). Human and Automatic Evaluation of English to Hindi
Machine Translation Systems. Advances in Computer Science, Engineering & Applications. Springer
Berlin Heidelberg.
AUTHORS
Pooja Gupta is pursuing her M.Tech in Computer Science from Banasthali University, Rajasthan and is
working as a Research Assistant in English-Indian Languages Machine Translation System Project
sponsored by TDIL Programme, DeitY. She has her interest in Machine Translation specifically in English-
Hindi Language Pair. Her current research interest includes Natural Language Processing and Machine
Translation.
Dr. Nisheeth Joshi is an Associate Professor at Banasthali University. He has been primarily working in
design and development of evaluation Matrices in Indian languages. Besides this he is also actively
involved in the development of MT engines for English to Indian Languages. He is one of the experts
empanelled with TDIL programme, Department of Electronics and Information Technology (DeitY), Govt.
of India, a premier organization which foresees Language Technology Funding and Research in India. He
has several publications in various journals and conferences and also serves on the Programme Committees
and Editorial Boards of several conferences and journals.
Iti Mathur is an Assistant Professor at Banasthali University. Her primary area of research is Computational
Semantics and Ontological Engineering. Besides this she is also involved in the development of MT
engines for English to Indian Languages. She is one of the experts empanelled with TDIL Programme,
Department of Electronics and Information Technology (DeitY), Govt. of India, a premier organization
which foresees Language Technology Funding and Research in India. She has several publications in
various journals and conferences and also serves on the Programme Committees and Editorial Boards of
several conferences and journals.