Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration. We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
Morphological stemming becomes a critical step toward natural language processing. The process of stemming is to reduce alternative forms to a common morphological root. Word segmentation for Myanmar Language, like for most Asian Languages, is an important task and extensively-studied sequence labelling problem. Named entity detection is one of the issues in Asian Language that has traditionally required a large amount of feature engineering to achieve high performance. The new approach is integrating them that would benefit in all these processes. In recent years, end-to-end sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks. We trained the model using manually annotated corpora. State-of-the-art named entity recognition systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that relies on two sources of information: character level representation and syllable level representation.
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scrip
ts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that
converts
the transliterated word or phrase back int
o its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a
computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intellig
ence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a
data
oriented statistical framewo
rk for translating text from one natural language to another based on the knowledge
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Waqas Tariq
Words of a sentence will not follow same ordering in different languages. This paper proposes certain Parts-of-Speech (POS) based rules for reordering the given English sentence to get translation in Telugu. The added rules for adverbs, exceptional conjunctions in addition to improved handling of inflections enable the system to achieve more accurate translation. The proposed rules along with existing system gave a score of 0.6190 with BLEU evaluation metric while translating sentences from English to Telugu. This paper deals with simple form of sentences in a better way.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration. We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
Morphological stemming becomes a critical step toward natural language processing. The process of stemming is to reduce alternative forms to a common morphological root. Word segmentation for Myanmar Language, like for most Asian Languages, is an important task and extensively-studied sequence labelling problem. Named entity detection is one of the issues in Asian Language that has traditionally required a large amount of feature engineering to achieve high performance. The new approach is integrating them that would benefit in all these processes. In recent years, end-to-end sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks. We trained the model using manually annotated corpora. State-of-the-art named entity recognition systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that relies on two sources of information: character level representation and syllable level representation.
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
A Review on a web based Punjabi t o English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scrip
ts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that
converts
the transliterated word or phrase back int
o its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a
computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intellig
ence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a
data
oriented statistical framewo
rk for translating text from one natural language to another based on the knowledge
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
Anaphora resolution in hindi language using gazetteer methodijcsa
Anaphora resolution is one of the active research areas within the realm of natural language processing.
Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This
paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various
methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution
in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies
operations to classify elements present in the list. There are many salient factors for resolving anaphora.
The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor
always represent living things and non living things whereas Recency describes that the referents
mentioned in current sentence tends to have higher weights than those in previous sentence. This paper
demonstrate the experiments conducted on short Hindi stories ,news articles and biography content from
Wikipedia, its result & future directions to improve accuracy.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
In this paper we discuss the difficulties in processing the Malayalam texts for Statistical Machine Translation (SMT), especially the verb forms. Mostly the agglutinative nature of Malayalam is the main issue with the processing of text. We mainly focus on the verbs and its contribution in adding the difficulty in processing. The verb plays a crucial role in defining the sentence structure. We illustrate the issues with the existing google translation system and the trained MOSES system using limited set of English- Malayalam parallel corpus. Our reference for analysis is English-Malayalam language pair.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
Anaphora resolution in hindi language using gazetteer methodijcsa
Anaphora resolution is one of the active research areas within the realm of natural language processing.
Resolution of anaphoric reference is one of the most challenging and complex task to be handled. This
paper completely emphasis on pronominal anaphora resolution for Hindi Language. There are various
methodologies for resolving anaphora. This paper presents a computational model for anaphora resolution
in Hindi that is based on Gazetteer method. Gazetteer method is a creation of lists and then applies
operations to classify elements present in the list. There are many salient factors for resolving anaphora.
The proposed model resolves anaphora by using two factors that is Animistic and Recency. Animistic factor
always represent living things and non living things whereas Recency describes that the referents
mentioned in current sentence tends to have higher weights than those in previous sentence. This paper
demonstrate the experiments conducted on short Hindi stories ,news articles and biography content from
Wikipedia, its result & future directions to improve accuracy.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Anaphora Resolution is a process of finding referents in discourse. In computational linguistic, Anaphora
resolution is complex and challenging task. This paper focuses on pronominal anaphora resolution. It is a
subpart of anaphora resolution where pronouns are referred to noun referents. Including anaphora
resolution into many applications like automatic summarization, opinion mining, machine translation,
question answering systems etc. increase their accuracy by 10%. Related work in this field has been done
in many languages. This paper focuses on resolving anaphora for Punjabi language. A model is proposed
for resolving anaphora and an experiment is conducted to measure the accuracy of the system. The model
uses two factors: Recency and Animistic knowledge. Recency factor works on the concept of Lappin Leass
approach and for introducing animistic knowledge gazetteer method is used. The experiment is conducted
on a Punjabi story containing more than 1000 words and result is drawn with the future directions.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
FUEL SUBSIDY: THE DAY OF RAGE By Olusegun AdeniyiOlusegun Adeniyi
With the nation evidently on tenterhooks, the House of Representatives convened an emergency session on Sunday, the 8th of January 2012, and decided among other things, to probe the management of the fuel subsidy scheme so as to unearth the factors responsible for the rot and find out how to tackle the structural and procedural deficiencies that have contributed to our woes as a nation. At that same session, the House also appealed to President Goodluck Jonathan to suspend the fuel price hike until a more fortuitous time.
While those interested in muckraking will find enough materials with which to peddle their trade from the testimonies of all the principal stakeholders, serious minded people will find sufficient proof (in the report) of the waste and corruption that are inherent in an unbridled regime of subsidy that is beyond reforming.
Perhaps with that, we can begin an honest conversation about whether it is indeed in the interest of our country to continue with the regime of fuel subsidy that has made emergency billionaires of some rent-seekers in the private sector and their collaborators in government while making nonsense of our oil and gas sector.
A Review on a web based Punjabi to English Machine Transliteration SystemEditor IJCATR
The paper presents the transliteration of noun phrases from Punjabi to English using statistical machine translation
approach.Transliteration maps the letters of source scripts to letters of another language.Forward transliteration converts an original
word or phrase in the source language into a word in the target language.Backward transliteration is the reverse process that converts
the transliterated word or phrase back into its original word or phrase.Transliteration is an important part of research in NLP.Natural
Language Processing (NLP) is the ability of a computer program to understand human speech as it is spoken.NLP is an important
component of AI.Artificial Intelligence is a branch of science which deals with helping machines find solutions to complex programs
in a human like fashion.The transliteration system is going to developed using SMT.Statistical Machine Translation (SMT) is a data
oriented statistical framework for translating text from one natural language to another based on the knowledge
International Journal of Engineering and Science Invention (IJESI) inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...cscpconf
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the
attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing
sentiment analysis in Arabic language. Our approach had begun with special pre-processing
steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an
algorithm extracted sentiment-based feature selection method. This feature selection method
had utilized on three sentiment corpora using SVM classifier. We compared our approach with
some traditional methods, followed by most SA works. The experimental results were very
promising for enhancing SA accuracy.
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
Extracting key phrases from documents is a common task in many applications. In general: The Noun
Phrase Extractor consists of three modules: tokenization; part-of-speech tagging; noun phrase
identification. These will be used as three main steps in building the new system ANPE, This paper aims at
picking Arabic Noun Phrases from a corpus of documents, Relevant criteria (Recall and Precision), will be
used as evaluation measure. On the one hand, when using NPs rather than using single terms, the system
yields more relevant documents from the retrieved ones, on the other hand, it gave low precision because
number of the retrieved documents will be decreased. At the researchers conclude and recommend
improvements for more effective and efficient research in the future.
Part of speech tagging is one of the basic steps in natural language processing. Although it has been
investigated for many languages around the world, very little has been done for Setswana language.
Setswana language is written disjunctively and some words play multiple functions in a sentence. These
features make part of speech tagging more challenging. This paper presents a finite state method for
identifying one of the compound parts of speech, the relative. Results show an 82% identification rate
which is lower than for other languages. The results also show that the model can identify the start of a
relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the
limitations of the morphological analyser and due to more complex sentences not accounted for in the
model.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
ANNOTATED GUIDELINES AND BUILDING REFERENCE CORPUS FOR MYANMAR-ENGLISH WORD A...ijnlc
Reference corpus for word alignment is an important resource for developing and evaluating word alignment methods. For Myanmar-English language pairs, there is no reference corpus to evaluate the word alignment tasks. Therefore, we created the guidelines for Myanmar-English word alignment annotation between two languages over contrastive learning and built the Myanmar-English reference corpus consisting of verified alignments from Myanmar ALT of the Asian Language Treebank (ALT). This reference corpus contains confident labels sure (S) and possible (P) for word alignments which are used to test for the purpose of evaluation of the word alignments tasks. We discuss the most linking ambiguities to define consistent and systematic instructions to align manual words. We evaluated the results of annotators agreement using our reference corpus in terms of alignment error rate (AER) in word alignment tasks and discuss the words relationships in terms of BLEU scores.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat
services. As a result, a text containing various English and Bangla words and phrases has become
increasingly common. Existing transliteration tools display poor performance for such texts. This paper
proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English
words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of
dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it
significantly outperforms the benchmark transliteration tool.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODIJwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the semi-supervised method with seed dictionaries and self-supervised adversarial training method through the similarity calculation of word clusters in different embedded spaces and puts forward an improved selfsupervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The experimental results are as follows. The seed dictionary of semi-supervised method made before 10 predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the self-supervision methods in both language directions have reached 53.5 accuracy.
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Methoddannyijwest
Tibetan is a low-resource language. In order to alleviate the shortage of parallel corpus between Tibetan
and Chinese, this paper uses two monolingual corpora and a small number of seed dictionaries to learn the
semi-supervised method with seed dictionaries and self-supervised adversarial training method through the
similarity calculation of word clusters in different embedded spaces and puts forward an improved self-
supervised adversarial learning method of Tibetan and Chinese monolingual data alignment only. The
experimental results are as follows. The seed dictionary of semi-supervised method made before 10
predicted word accuracy of 66.5 (Tibetan - Chinese) and 74.8 (Chinese - Tibetan) results, to improve the
self-supervision methods in both language directions have reached 53.5 accuracy.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Similar to RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI (20)
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
DOI : 10.5121/ijnlc.2013.2207 67
RULE BASED TRANSLITERATION SCHEME FOR
ENGLISH TO PUNJABI
Deepti Bhalla1
, Nisheeth Joshi2
and Iti Mathur3
1,2,3
Apaji Institute, Banasthali University, Rajasthan, India
1
deeptibhalla0600@gmail.com
2
nisheeth.joshi@rediffmail.com
3
mathur_iti@rediffmail.com
ABSTRACT
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
KEYWORDS
Machine Translation, Machine Transliteration, Name entity recognition, Syllabification.
1. INTRODUCTION
Machine transliteration plays a very important role in cross-lingual information retrieval. The
term, name entity is widely used in natural language processing. Name entity recognition has
many applications in the field of natural language processing like information extraction i.e.
extracting the structured text from unstructured text, data classification and question answering.
In this paper we are only emphasizing on transliteration of proper names & locations using
statistical approach. The process of transliterating a word from its native language to foreign
language is called forward transliteration and the reverse process is called backward
transliteration. This paper addresses the problem of forward transliteration i.e. from English to
Punjabi. For Example, ‘Silver’ in English is transliterated to ‘ਿਸਲਵਰ’ into Punjabi instead of
‘ਚਦੀ ’. Here our main aim is to develop a transliteration system which can be used to properly
transliterate the name entities as well as the words which are not name entities from source
language to target language. We have used English-Punjabi parallel corpus of proper names and
locations to perform our experiment.
The remainder of this paper is organized as follows: In section 2 we described related work
Section 3 gives a brief introduction about analysis of English and Punjabi script. Section 4
explains our methodology and experimental set up. Section 5 consists of evaluation and results.
Finally in section 6 we conclude the paper.
2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
68
2. RELATED WORK
A lot of work has been carried out in the field of machine transliteration. Kamal Deep and Goyal
[1] have addressed the problem of transliterating Punjabi to English language using rule based
approach. The proposed transliteration scheme uses grapheme based method to model the
transliteration problem. This technique has demonstrated transliteration from Punjabi to English
for common names and achieved accuracy of 93.22%. Sharma et al. [2] presented English-Hindi
Transliteration using Statistical Machine Translation in different Notations. In this paper they
have shown that transliteration in WX-notation gives improved results over UTF-notation by
training the English-Hindi corpus (of Indian names) using Phrase based statistical machine
translation. Kamal deep et al. [3] proposed Hybrid Approach for Transliteration from Punjabi to
English. In this paper they addressed the problem of transliterating Punjabi to English language
using statistical rule based approach. The Authors used letter to letter mapping as baseline and
tried to find out the improvements by statistical methods. kaur and Josan [4] proposed a
framework which addresses the issue of Statistical Approach to Transliteration from English to
Punjabi. In this Statistical Approach to transliteration is used from English to Punjabi using
MOSES, a statistical machine translation tool. They applied transliteration rules and after that
showed average percentage, accuracy of the system and BLEU score which comes out to be
63.31% and 0.4502 respectively. Padda et al. [5] presented conversion system on Punjabi Text to
IPA. In this paper they described steps for converting Punjabi text to IPA. In this paper, they have
shown how a letter of Punjabi maps directly to its corresponding IPA symbol which represents
the sounds of spoken language. Josan et al. [6] presented a novel approach to improve Punjabi to
Hindi transliteration by combining a basic character to character mapping approach with rule
based and Soundex based enhancements. Experimental results show that this approach effectively
improves the word accuracy rate and average Levenshtein distance. Dhore et al. [7] have
addressed the problem of machine transliteration where given a named entity in Hindi using
Devanagari script need to be transliterated in English using CRF as a statistical probability tool
and n-gram as a features set. In this approach, they presented machine transliteration of named
entities for Hindi-English language pair using CRF as a statistical probability tool and n-gram as
feature set. They achieved accuracy of 85.79% . Musa et al. [8] proposed Syllabification
Algorithm based on Syllable Rules Matching for Malay Language. In this paper, authors
presented a new syllabification algorithm for Malay language.
3. ANALYSIS OF ENGLISH AND PUNJABI SCRIPT
Punjabi is an ancient and Indo-Aryan language which is mainly used in regions of Punjab. It is
one of the world’s 14th
widely spoken languages. Its script is Gurumukhi which is based on
Devanagri. Punjabi is one of the Indian language and official language of Punjab. Along with
Punjab it is also spoken in neighbouring areas such as Haryana and Delhi. There are 19 Vowels
(10 Independent Vowels and 9 dependent Vowels) and 41 Consonants in Punjabi language.
English on the other hand is based on Roman Script. It is one of the six international languages of
United Nations. There are 26 letters in English. Out of which 21 are Consonants and 5 are
Vowels.
4. METHODOLOGY AND EXPERIMENTAL SETUP
In our work, we have English Punjabi parallel corpus of name entities. We are basically
emphasizing on transliteration of our input text with the help of rule matching. We are using
statistical approach for transliteration that is we train and test our model using MOSES toolkit.
With the help of GIZA++ we have calculated translation probabilities which are based on relative
3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
69
frequency. For constructing our rules we are using syllabification approach. The syllable is a
combination of vowel and consonant pairs such as V, VC, CV, VCC, CVC, CCVC and CVCC.
Almost all the languages have VC, CV or CVC structures so we are using vowel and consonants
as the basic phonification unit.
Some possible rules that we have constructed for syllabification are shown in table 1:
Table1: Possible combinations for syllables
Syllable
structure
Example Syllabified
form(English)
Syllabified
form(Punjabi)
V eka [e] [ka] [ਏ] [ਕਾ]
CV tarun [ta] [run] [ਤ] [ਰੁਣ]
VC angela [an] [ge] [la] [ਅਨ] [ਗੇ] [ਲਾ]
CVC sidhima [si] [dhi] [ma] [ਿਸ] [ਧੀ] [ਮਾ]
CCVC Odisha [o] [di] [sha] [ਓ] [ਡੀ] [ਸ਼ਾ]
VCC obhika [o] [bhi] [ka] [ਓ] [ਭੀ] [ਕਾ]
V=Vowel, C=Consonant
If we have a word P then it will be syllabified in the form of {p1,p2,p3….pn} where p1,p2..pn are
the individual syllables. We have syllabified our English input using rule based approach for
which we have used the following algorithm.
Algorithm for syllable extraction
1. Enter input string in English.
2. Identify Vowels and Consonants.
3. Identify Vowel-Consonant combination and consider it as one syllable.
4. Identify Consonants followed by Vowels and consider them as separate syllables.
5. Identify Vowels followed by two continuous Consonants as separate syllable.
6. Consider Vowel surrounded by two Consonants as separate syllable.
7. Transliterate each syllable into Punjabi.
4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
70
For converting our English input to equivalent Punjabi output we first recognize the name entities
from our input sentence. For this we have used Stanford NER tool [10] which gives us name
entities in English text. The text entered by the user is first analyzed and then pre-processed. After
pre-processing if the selected input is a proper name or location then it is passed to the
syllabification module through which the syllables are extracted. After selection of equivalent
probability, syllables are combined to form the Punjabi word. If the selected input is not a name
entity it will be passed to the syllabification module and will be transliterated with the help of
probability matching. Separate probabilities are calculated for those words which are not name
entities.
Table 2: Corpus
English
Words
Punjabi
Words
English
Sentences
Training 25,500 25,500
Testing 6080 1500
Table 2 demonstrates the English-Punjabi parallel corpus that we have used for performing our
experiment. In this we have trained our language model using IRSTLM toolkit [9]. Our
transliteration system follows the steps which are represented in figure 1.
Example:
Source sentence: Teena is going to Haryana
Target sentence: ਟੀਨਾ ਇਜ ਗੋਇੰਗ ਟੂ ਹਿਰਆਣਾ
In this initially the name entities will be extracted which are Teena (proper name) and Haryana
(location). After pre-processing, these words will be checked with the help of name entity
recognizer for identifying whether they are name entities or not. If the word is a name entity of
type proper name and location, they will be passed to the syllabification module. In this case the
name Teena will be syllabified in the form of Tee na and location name Haryana will be
syllabified in the form of Har ya na. After syllabification, these words will be checked for their
corresponding English-Punjabi probabilities and equivalent Punjabi syllables are selected. The
words other than the name entities are syllabified and matched against the separate probabilities.
For example going are transliterated to go ing. Finally these syllables are joined to generate final
output sentence.
Through language model training we have calculated N-gram probabilities which are based on
relative frequency. For calculating the N-gram probability we are using the following formula
[11].
(1)
Where probability of a word wn given a word wn-1 is count (wn-1, wn)/count (wn-1)
For example:
5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
71
For a name Kunal probabilities for every syllable can be as follows:
= 0.8797654
= 0.7333333
Final Score=Prob(ਕੁ|ku)×Prob(ਣਾਲ|nal)
=0.64516127
Figure 1: System architecture
6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
72
After selecting the equivalent Punjabi syllable on the basis of probabilities these syllables are
joined together and final output is obtained.
5. EVALUATION AND RESULTS
We have used a corpus of 1500 sentences which had 15982 words. Among them 45% were name
entities i.e. 7192 were proper names, locations, organization and Date & Time. Among these 88%
were name entities of type person & location i.e. 6328 were person names or location names.
From these our system was able to correctly transliterate 6109 name entities. Remaining 55 % i.e.
8790 words were those that were not name entities and from these our system was able to
transliterate 7987 words correctly. This provided us with accuracy of 88.19%. The accuracy was
calculated using the formula:
*100 (2)
Some of the names which are correctly transliterated by our system are as follows:.
Table 3: Correct output
spardha ਸਪਰਧਾ
sanbosh ਸੰਬੋਸ਼
kamaldeep ਕਮਲਦੀਪ
haryana ਹਿਰਆਣਾ
shelly ਸ਼ੇਲੀ
Some of the names which are not correctly transliterated by our system are as follows:
Table 4: Incorrect output
falkanvai ਫਲਕੰਵਆ
gowryharan ਗੋਰੀਹਰਣ
seerajuddeen ਸੀਰਾਜੁਿਦਨ
kolkandkar ਕੋਲਕੰ ਡਕਰ
6. CONCLUSION
In this paper we have performed our experiment using statistical machine translation tool. We
have shown how we can extract syllables from our input text and perform transliteration of name
7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.2, April 2013
73
entities (proper names and locations) and rest of the words other than name entities in source
language to target language through probability calculation. This transliteration system will prove
to be very beneficial. Through this approach we attained accuracy of 88.19%. In future we will
work on remaining name entities that were not included in this experiment.
7. REFERENCES
[1] Kamal Deep and Vishal Goyal, (2011) ”Development of a Punjabi to English transliteration system”. In
International Journal of Computer Science and Communication Vol. 2, No. 2, pp. 521-526.
[2] Shubhangi Sharma, Neha Bora and Mitali Halder, (2012) “English-Hindi Transliteration using
Statistical Machine Translation in different Notation” International Conference on Computing and
Control Engineering (ICCCE 2012).
[3] Kamal Deep, Dr.Vishal Goyal, (2011) “Hybrid Approach for Punjabi to English Transliteration
System” International Journal of Computer Applications (0975 – 8887) Volume 28– No.1.
[4] Jasleen kaur Gurpreet Singh josan , (2011) “Statistical Approach to Transliteration from English to
Punjabi”, In Proceeding of International Journal on Computer Science and Engineering (IJCSE), Vol. 3
Issue 4, p1518.
[5] Er. Sheilly Padda, Rupinderdeep Kaur, Er. Nidhi, (2012) “Punjabi Phonetic: Punjabi Text to IPA
Conversion” International Journal of Emerging Technology and Advanced Engineering Website:
www.ijetae.com ISSN 2250-2459, Volume 2, Issue 10.
[6] Gurpreet Singh Josan, Gurpreet Singh Lehal, (2010) “A Punjabi to Hindi Machine Transliteration
System” Computational Linguistics and Chinese Language Processing Vol. 15, No. 2, pp. 77-102.
[7] Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar, (2012) “Hindi to English Machine
Transliteration of Named Entities using Conditional Random Fields.” International Journal of Computer
Applications;6/15/2012, Vol. 48, p31.
[8] Musa, Hafiz, Rabith A.kadir, Azreen Azman, M.taufik Abadullah, (2011) "Syllabification algorithm
based on syllable rules matching for Malay language." Proceedings of the 10th WSEAS international
conference on Applied computer and applied computational science. World Scientific and Engineering
Academy and Society (WSEAS).
[9] To download IRSTLM toolkit http://www.statmt.org
[10] Jenny Rose Finkel, Trond Grenager, and Christopher Manning, (2005) Incorporating Non-local
Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual
Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.
[11] Daniel Jurafsky, James H. Martin Speech and Language processing An Introduction to speech
Recognition, natural language processing, and computational linguistics.
Authors
Deepti Bhalla is pursuing her M.Tech in Computer Science from Banasthali University, Rajasthan and is
working as a Research Assistant in English-Indian Languages Machine Translation System Project
sponsored by TDIL Programme, DEITY. She has her interest in Machine Translation specifically in
English-Punjabi Language Pair. She has developed various tools on Punjabi Language Processing. Her
current research interest includes Natural Language Processing and Machine Translation.
Nisheeth Joshi is a researcher working in the area of Machine Translation. He has been primarily working
in design and development of evaluation Matrices in Indian languages. Besides this he is also actively
involved in the development of MT engines for English to Indian Languages. He is one of the expert
empanelled with TDIL programme, Department of electronics Information Technology (DEITY), Govt.
of India, a premier organization which foresees Language Technology Funding and Research in India. He
has several publications in various journals and conferences and also serves on the Programme
Committees and Editorial Boards of several conferences and journals.
Iti Mathur is an assistant professor at Banasthali University. Her primary area of research is
computational semantics and ontological engineering. Besides this she is also involved in the
development of MT engines for English to Indian Languages. She is one of the experts empanelled with
TDIL Programme, Department of Electronics Information Technology (DEITY), Govt. of India, a
premier organization which foresees Language Technology Funding and Research in India. She has
several publications in various journals and conferences and also serves on the Programme Committees
and Editorial Boards of several conferences and journals.