Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
Phrase Structure Identification and Classification of Sentences using Deep Le...ijtsrd
Phrase structure is the arrangement of words in a specific order based on the constraints of a specified language. This arrangement is based on some phrase structure rules which are according to the productions in context free grammar. The identification of the phrase structure can be done by breaking the specified natural language sentence into its constituents that may be lexical and phrasal categories. These phrase structures can be identified using parsing of the sentences which is nothing but syntactic analysis. The proposed system deals with this problem using Deep Learning strategy. Instead of using Rule Based technique, supervised learning with sequence labelling is done using IOB labelling. This is a sequence classification problem which has been trained and modeled using RNN LSTM. The proposed work has shown a considerable result and can be applied in many applications of NLP. Hashi Haris | Misha Ravi ""Phrase Structure Identification and Classification of Sentences using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23841.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23841/phrase-structure-identification-and-classification-of-sentences-using-deep-learning/hashi-haris
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
A prior case study of natural language processing on different domain IJECEIAES
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
Phrase Structure Identification and Classification of Sentences using Deep Le...ijtsrd
Phrase structure is the arrangement of words in a specific order based on the constraints of a specified language. This arrangement is based on some phrase structure rules which are according to the productions in context free grammar. The identification of the phrase structure can be done by breaking the specified natural language sentence into its constituents that may be lexical and phrasal categories. These phrase structures can be identified using parsing of the sentences which is nothing but syntactic analysis. The proposed system deals with this problem using Deep Learning strategy. Instead of using Rule Based technique, supervised learning with sequence labelling is done using IOB labelling. This is a sequence classification problem which has been trained and modeled using RNN LSTM. The proposed work has shown a considerable result and can be applied in many applications of NLP. Hashi Haris | Misha Ravi ""Phrase Structure Identification and Classification of Sentences using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23841.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23841/phrase-structure-identification-and-classification-of-sentences-using-deep-learning/hashi-haris
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
A prior case study of natural language processing on different domain IJECEIAES
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Natural Language Processing Theory, Applications and Difficultiesijtsrd
The promise of a powerful computing device to help people in productivity as well as in recreation can only be realized with proper human machine communication. Automatic recognition and understanding of spoken language is the first step toward natural human machine interaction. Research in this field has produced remarkable results, leading to many exciting expectations and new challenges. This field is known as Natural language Processing. In this paper the natural language generation and Natural language understanding is discussed. Difficulties in NLU, applications and comparison with structured programming language are also discussed here. Mrs. Anjali Gharat | Mrs. Helina Tandel | Mr. Ketan Bagade "Natural Language Processing Theory, Applications and Difficulties" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28092.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/28092/natural-language-processing-theory-applications-and-difficulties/mrs-anjali-gharat
Source side pre-ordering using recurrent neural networks for English-Myanmar ...IJECEIAES
Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
Recent development of generative pretrained language models has been proven very successful on a wide range of NLP tasks, such as text classification, question answering, textual entailment and so on.In this work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by both automatic metrics and human annotators, and demonstrated that the architecture achieves the state-of-the-art comparable result on large scale corpus - CNN/Daily Mail1. As the best of our knowledge, this is the first work that applies BERT based architecture to a text summarization task and achieved the state-of-the-art comparable result.
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
Recent development of generative pretrained language models has been proven very successful on a wide
range of NLP tasks, such as text classification, question answering, textual entailment and so on. In this
work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding
Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by
both automatic metrics and human annotators, and demonstrated that the architecture achieves the stateof-the-art comparable result on large scale corpus – ‘CNN/Daily Mail1
As the best of our knowledge’, this
is the first work that applies BERT based architecture to a text summarization task and achieved the stateof-the-art comparable result.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...IJERA Editor
This research paper reports preliminary results of data-driven modeling of segmentalphoneme duration for
Marathi. Classification and Regression Tree based data driven duration modeling for segmental duration
prediction is presented. A number of features are considered and their usefulness and relative contribution for
segmental duration prediction is assessed. Objective evaluation of the duration model, by root mean squared
prediction error and correlation between actual and predicted durations, is performed.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...IJECEIAES
In Natural Language Processing (NLP), Word segmentation and Part-ofSpeech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Natural Language Processing Theory, Applications and Difficultiesijtsrd
The promise of a powerful computing device to help people in productivity as well as in recreation can only be realized with proper human machine communication. Automatic recognition and understanding of spoken language is the first step toward natural human machine interaction. Research in this field has produced remarkable results, leading to many exciting expectations and new challenges. This field is known as Natural language Processing. In this paper the natural language generation and Natural language understanding is discussed. Difficulties in NLU, applications and comparison with structured programming language are also discussed here. Mrs. Anjali Gharat | Mrs. Helina Tandel | Mr. Ketan Bagade "Natural Language Processing Theory, Applications and Difficulties" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd28092.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/28092/natural-language-processing-theory-applications-and-difficulties/mrs-anjali-gharat
Source side pre-ordering using recurrent neural networks for English-Myanmar ...IJECEIAES
Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
Recent development of generative pretrained language models has been proven very successful on a wide range of NLP tasks, such as text classification, question answering, textual entailment and so on.In this work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by both automatic metrics and human annotators, and demonstrated that the architecture achieves the state-of-the-art comparable result on large scale corpus - CNN/Daily Mail1. As the best of our knowledge, this is the first work that applies BERT based architecture to a text summarization task and achieved the state-of-the-art comparable result.
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
Recent development of generative pretrained language models has been proven very successful on a wide
range of NLP tasks, such as text classification, question answering, textual entailment and so on. In this
work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding
Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by
both automatic metrics and human annotators, and demonstrated that the architecture achieves the stateof-the-art comparable result on large scale corpus – ‘CNN/Daily Mail1
As the best of our knowledge’, this
is the first work that applies BERT based architecture to a text summarization task and achieved the stateof-the-art comparable result.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...IJERA Editor
This research paper reports preliminary results of data-driven modeling of segmentalphoneme duration for
Marathi. Classification and Regression Tree based data driven duration modeling for segmental duration
prediction is presented. A number of features are considered and their usefulness and relative contribution for
segmental duration prediction is assessed. Objective evaluation of the duration model, by root mean squared
prediction error and correlation between actual and predicted durations, is performed.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...IJECEIAES
In Natural Language Processing (NLP), Word segmentation and Part-ofSpeech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
NLP Techniques for Text Generation.docxKevinSims18
Natural Language Processing (NLP) techniques are a subset of artificial intelligence (AI) that deals with the interactions between computers and human language. Text generation is an important application of NLP that involves the automatic creation of human-like text. This blog post will explore some of the NLP techniques used for text generation.
Eat it, Review it: A New Approach for Review Predictionvivatechijri
Deep Learning has achieved significant improvement in various machine learning tasks. Nowadays,
Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) have been increasing its popularity on
Text Sequence i.e. word prediction. The ability to abstract information from image or text is being widely
adopted by organizations around the world. A basic task in deep learning is classification be it image or text.
Current trending techniques such as RNN, CNN has proven that such techniques open the door for data analysis.
Emerging technologies such has Region CNN, Recurrent CNN have been under consideration for the analysis.
Recurrent CNN is being under development with the current world. The proposed system uses Recurrent Neural
Network for review prediction. Also LSTM is used along with RNN so as to predict long sentences. This system
focuses on context based review prediction and will provide full length sentence. This will help to write a proper
reviews by understanding the context of user.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...kevig
Researchers of many nations have developed automatic speech recognition (ASR) to show their national
improvement in information and communication technology for their languages. This work intends to
improve the ASR performance for Myanmar language by changing different Convolutional Neural Network
(CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of
reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality
and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is
investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on
syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate
(WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76%
SER on TestSet2.
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Nischal Lal Shrestha
This study investigates the feasibility of identifying gender via personal names written in the Devanagari
script. While extensive research exists for languages like English, Russian, and Brazilian, little attention
has been given to low-resource language like Nepali. This study addresses this gap, aiming to enhance
gender classification in Nepali names. We compare traditional machine learning algorithms along
with advanced contextualized transformer architectures like variants of BERT (Bidirectional Encoder
Representations from Transformers) specifically fine-tuned for detecting gender among individuals with
Nepali names. Our experiments aim at exploring how effectively these techniques capture linguistic
nuances unique to the Nepali language and assess their overall performance in terms of accuracy,
precision, recall, and F1 score metrics. Both experiments reveal superior performance by the BERT
variants DeBERTa and DistilBERT, demonstrating the merits of advanced NLP techniques for gender
classification in Nepali. The study also show that the additional encoding of name endings (last
character of a name) improve the performance of traditional ML algorithms by 10%. Through this
work, we contribute to gender classification methodologies tailored specifically for the linguistic nuances
of the Nepali language.
Arabic named entity recognition using deep learning approachIJECEIAES
Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch
of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis,
natural language understanding, Information Extraction, Information retrieval, question answering etc.
The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based
approach of machine learning in detail to identify the named entities. The main idea behind the use of
HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can
use it according to their interest. The corpus used by our NER system is also not domain specific.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific.
Similar to SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE (20)
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
1. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
DOI: 10.5121/ijnlc.2019.8101 1
SYLLABLE-BASED NEURAL NAMED ENTITY
RECOGNITION FOR MYANMAR LANGUAGE
Hsu Myat Mo and Khin Mar Soe
Natural Language Processing Lab., University of Computer Studies,
Yangon, Myanmar
ABSTRACT
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language
processing research work. In this work, NER for Myanmar language is treated as a sequence tagging
problem and the effectiveness of deep neural networks on NER for Myanmar language has been
investigated. Experiments are performed by applying deep neural network architectures on syllable level
Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed
and proposed. In developing our in-house NER corpus, sentences from online news website and also
sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian
Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural
network models on NER task for Myanmar language. The experimental results show that those neural
sequence models can produce promising results compared to the baseline CRF model. Among those neural
architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This
work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing
as well as to promote further researches on this understudied language.
KEYWORDS
Bidirectional LSTM_CRF, Myanmar Language, Named Entity Recognition, Neural architectures, Syllable-
based
1. INTRODUCTION
Named Entity Recognition is the process of automatically tagging, identifying of labeling
different named entities (NE) in text in accordance with the predefined sets of NE categories
(e.g., person, organization, location and time and so on). NER has been a challenging problem in
Myanmar language processing. Currently, Myanmar NLP is at an initial stage and lexical
resources available are very limited. On the other hand, it has both complex and rich morphology
as well as ambiguity. These facts can lead to wrong word segmentation. The current Myanmar-
English Machine Translation system couldn’t recognize names written in Myanmar language
properly. Moreover, the most important motivation behind this work is that there is no currently
available NER tool that can extract named entities in Myanmar scripts. As far as it is concerned,
the previous proposed methods for Myanmar NER adopted dictionary look-up approach, rule-
based techniques and a combination of rule-based and statistical N-grams based method together
with name database.
The previous proposed methods for NER can be roughly classified into dictionary-based or rule-
based and traditional statistical sequence labeling approaches and hybrid [1]. Those approaches
require linguistic knowledge and feature engineering.
2. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
2
Neural network models have the advantage of minimizing the effort in feature engineering, since
deep layers of neural networks can discover relevant features to tasks. General neural network
architecture for sequence labelling tasks was developed and proposed in [2]. Following this work,
Recurrent Neural Networks (RNN-based networks) have been designed for modeling sequential
data and have been proved to be quite efficient in NER as a sequential tagging task. As a special
kind of RNN, Long Short-Term Memory (LSTM) neural network has been verified to be
powerful in modeling sequential data. Moreover, bidirectional LSTM neural network which is
derived from LSTM network, introduces two independent layers to accumulate contextual
information from the past and future histories. Bidirectional LSTM network has advantages in
memorizing information for long periods in both directions so that it makes great improvement in
linguistic computation. In [3], the authors combined a bidirectional LSTM network and a
Conditional Random Fields (CRF) to form a BiLSTM_CRF network. This kind of model can use
the past input features and sentence level tag information and the future input features. With the
rapid development of deep learning, recent research revealed that those kinds of networks
significantly outperform statistical algorithms.
In this work, NER is considered as a sequence labeling task and experiments are carried out on
syllable level data. In order to explore the effectiveness of deep learning on Myanmar Language,
experiments are conducted using bidirectional LSTM network by adding CRF inference layer
above. We will refer that model as BiLSTM_CRF in the following sections. . In our experiments,
we firstly encode syllable level information by using various neural networks such as
Convolutional Neural Network (CNN) and bidirectional LSTM network on characters and then
concatenate character and syllable level representation. The combined representation was fed into
bidirectional LSTM network to learn context information of each syllable. On the top of
bidirectional LSTM, a CRF inference layer was jointly added to decode the optimal label.
Besides, we also performed the neural experiments with softmax layer on the top. That kind of
model will be referred as BiLSTM in the following sections. Experiments are carried out with
different parameters and network settings. Moreover, we tried to investigate which network
(CNN and bidirectional LSTM) is more powerful in learning character representation.
Experiments with these neural networks, without additional feature engineering, achieved
promising results compared to the baseline CRF model. This study is the very first work to apply
neural networks to Myanmar NER.
2. RELATED WORKS
As far as being aware and up to our knowledge, no work has been published for using deep neural
networks on NER for Myanmar language. Previous effort on Myanmar NER had been done by
rule-based and statistical research works. A method for Myanmar Named Entity Identification
using a hybrid method was proposed in [4]. Their method was a combination of rule-based and
statistical N-grams based method and name database was used as well. In [5], the authors
proposed a Myanmar Named Identification algorithm that defines the names by using some of the
POS information, NE identification rules and clues words in the left and/or the right contexts of
NEs that carry information for NE identification. Their limitation is that input sentence must be
with specified POS tags. As a weakness, sematic implication of proper names is defeasible.
Moreover, those approaches require linguistic knowledge and feature engineering. CRF-based
NER for Myanmar language can be seen in [6]. In this paper, CRF, one of the statistical
approaches had been proposed for Myanmar NER task. Moreover, the authors tried to explore on
various combination of features for this approach.
Although statistical approaches such as CRFs have been widely applied to NER tasks, those
approaches heavily rely on feature engineering. The authors of [7] introduced two neural
architectures for NER that use no features. One was based on bidirectional LSTMs and CRF, and
3. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
3
another one was constructed using a transition-based approach. Likewise, a neural network
architecture that automatically detects word- and character-level features using a hybrid
bidirectional LSTM and CNN architecture to eliminate the need for most feature engineering was
proposed in [8].
Over the past few years, various deep neural networks have been applied for NER on different
languages, e.g., Italian neural NER [9], Mongolian neural NER [10], Japanese neural NER [11]
and Russian NER [12] and so on. Likewise, deep neural architectures also have been applied on
NER for different domains, e.g., Neural NER for Medical Entities in Twitter [13] and NER for
Twitter Massages [14], Biomedical Neural NER [15] and disease NER [16]. Their works revealed
that neural networks have the great capability for NER tasks and significantly outperforms
statistical algorithms.
3. MYANMAR LANGUAGE
Myanmar language is the official language of the Republic of the Union of Myanmar and has
more than one thousand years' history. According to the documents, the Myanmar script was
descended from the Brahmi script of ancient South India. It belongs to the Sino-Tibetan language
family. Myanmar scripts are written in sequence from left to right in which white space may
sometimes be inserted between phrases but regular white space is not used between words or
between syllables (see Figure. 1 for the example). Words in Myanmar language are composed of
single or multiple syllables. Moreover, it can be said that a Myanmar syllable is a composition of
multiple characters. The Myanmar scripts usually have 75 characters in total and those characters
can be classified into 12 groups. For more details, you can check the paper [17].
Figure 1. Example of Myanmar writing
Words in Myanmar language are composition of one or more syllables and a syllable may also
contain one or more characters. A word ‘နိုင်ငံ’ is composed of two syllables ‘နိုင်’ and ‘ငံ’. A
syllable in Myanmar language may be made up of one or several characters. For example, the
syllable ‘နိုင်’ incudes five Unicode characters, i.e., ‘န, ိ, , င’ and ‘်’ (see Figure 2). In this
paper, the character refers to the Unicode character instead of non-standard font characters.
Figure 2. An example of a Myanmar word formation
For Myanmar language, there are some complex font problems. Most Myanmar people are more
familiar with fonts that are not standard Unicode font. To make Myanmar syllable structure
represented in a definite way, we have Unicode level as the character level in this work.
A Myanmar syllable structure formation is quite definite and simple. Although the constituents
can appear in different sequences, a Myanmar syllable usually consists of one initial consonant
followed by zero or more medials, zero or more vowels and optional dependent various signs.
4. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
4
Independent vowels, independent various signs and digits can act as stand-alone syllables.
Moreover, only one consonant can even stand as a syllable. According to the Unicode standard,
vowels are stored after the consonant. Detailed syllable segmentation structure for Myanmar
language is explained in [17] and [18].
For example, the following Myanmar sentence, “ရန်ကိုန်တွင်မိုိုးမရွွာပါ။”, contains seven syllables.
The character “။” is sentence ending punctuation mark. White space is used between syllables
(see Figure 3). The meaning of the sentence is “It doesn’t rain in Yangon”. “Yangon” is the city
of the Republic of the Union of Myanmar. The proposed syllable-based sequential labeling model
for Myanmar NER is trained and tested on the syllable level input data.
Figure 3. Syllable-based segmented sentence
Myanmar language is very complex compared to English and other European languages. On the
other hand, language resources for Myanmar NLP researches have not been well prepared until
now. As one of agglutinative languages, Myanmar has complex morphological structures which
can lead to suffer the models from data sparsity. Besides, for models in which words are treated
as basic unit to construct distributed representation, there are problems for those rich
morphological words. On the other hand, it is needed to deal with out-of-vocabulary words and
word segmentation problem is one of the problems for Myanmar language. As described above,
words in Myanmar language are not always separated by spaces, so word segmentation is
necessary and segmentation errors will affect the level of NER performance. In Myanmar
language, syllable is the smallest linguistic unit that can carry information about word. For those
reasons, syllable is treated as the basic unit for label tagging in all our NER experiments.
3.1. CHALLENGES OF NER FOR MYANMAR LANGUAGE
The task of identifying names automatically in Myanmar text is complex compared to other
languages for many reasons. One of the reasons is the lack of resources such as annotated corpus,
name lists, gazetteers or name dictionary, etc. which means that Myanmar language is resource-
constrained language. Besides, Myanmar language has distinct characteristics and having no
capitalized latter which is the main indicator of proper names for some other languages like
English. Moreover, its writing structure is of the free order, makes the NER a complex process.
Some proper names of foreign person and location are loanwords or transliterated words so that
there are wide variations in some Myanmar terms. Myanmar names also take all morphological
inflections which can lead to ambiguity. This ambiguity of NE may lead to problem in classifying
named entities into predefined types. It can be said that how to perform the task of identifying
names in Myanmar text automatically is still challenging.
4. METHODOLOGY
4.1. CONVOLUTIONAL NEURAL NETWORK (CNN)
Convolutional Neural Network (CNN) is popular in dealing with images and it also has excellent
results in computer vision. Lately, it has been applied in NLP research tasks and it can be found
that it outperformed traditional models such as bag of words and n-grams and so on.
5. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
5
4.2. RECURRENT NEURAL NETWORK (RNN)
Recurrent Neural Network (RNN) is powerful for learning sequential information. RNN does the
same operation over and over again on stream of data. RNN comes with the assumption that the
output is being dependent on the previous computations while a traditional neural network
assumes all inputs (and outputs) are independent of each other. In theory, RNNs can make use of
information in arbitrarily long sequences, but in practice they fail due to the gradient
vanishing/exploding problems [19].
4.3. LONG SHORT-TERM MEMORY (LSTM)
LSTMs, a special kind of RNN, are designed to address the gradient vanishing problems of
RNNs. Therefore, they are capable of learning long-term dependencies. They were introduced by
[20]. The key to LSTM is the cell state which is controlled by three multiplicative gates. The
formulas to update each gate and cell state of input x are defined as follows:
) (1)
(2)
(3)
(4)
(5)
(6)
where denotes the sigmoid function; is the element-wise multiplication operator; W terms
denote weight metrics; b are bias vectors; and i, f, o denote input, forget and output gate
respectively and c are cell activation vectors. The term h represents the hidden layer nodes.
4.4. BIDIRECTIONAL LSTM
BiLSTM network, a modification of the LSTM, is designed to capture information of sequential
data and have access to both past and future contexts. It consists of two LSTMs, a forward and
backward LSTM propagating in two directions. The forward LSTM reads stream of input data
and computes the forward hidden states, while the backward LSTM reads the sequence in reverse
order and creates the backward hidden states. The idea behind is that it creates two separate
hidden states for each sequence so that the information of sequences from both directions is
memorized. By concatenating the two hidden states, the final output is formed. This advantage of
memorizing information for long periods in both directions can make great improvement in
linguistic computation.
4.5. CONDITIONAL RANDOM FIELDS (CRF)
CRF is a statistical probabilistic model for structured prediction. It provides a probabilistic
framework for labeling and segmenting sequential data, based on the conditional approach. Let y
be a tag label sequence and x be an input sequence of words. Conditional models are used to label
the observation input sequence x by selecting the label sequence y that maximizes the conditional
probability . To do so, a conditional probability is computed:
6. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
6
(7)
where Score is determined by defining some log potentials such that:
(8)
In here, there are two kinds of potentials: emission and transition.
(9)
In training, log probability of a correct tag sequence is maximized.
5. OUR WORK
Experiments are performed by comparing different neural models on syllable level text rather
than word level and the performance results are compared with baseline statistical CRF model. In
our neural training, we try to investigate various neural network architectures that automatically
detect syllable and character level features using bidirectional LSTM, CNN and also GRU
architecture to eliminate the need for most feature engineering. Both CNN and bidirectional
LSTM networks have been investigated for modeling character level information.
5.1. TRAINING SETUP
Experiments were carried out as syllable-based sequence labeling, in which the LSTM-based
neural networks were trained. In order to convert the NER problem into a sequence tagging
problem, a label is assigned for each token (syllable) to indicate the named entities in training and
test data. We used the Myanmar syllable segmentation algorithm “sylbreak” of [21] on sentences
for the syllable data representation and syllable-based labeling.
5.2. DATA PREPARATION
As mentioned before, annotated corpus for Myanmar language are limited and scattered. For our
experiment and further researches on Myanmar NER, we developed a manually annotated
Myanmar NER corpus. There is no other available Myanmar NE corpus that has as much data as
our NE corpus. For Corpus building, Myanmar news sentences were collected from online
official websites and manually annotated according to defined NEs tags. Moreover, we also take
Myanmar sentences from ALT-Parallel-Corpus because a lot of transliterated names appear in
these sentences. Sentences from ALT corpus are translated from International news so that a lot
of transliterated names are detected. The ALT corpus is one part of the Asian Language Treebank
(ALT) project. Currently we have over 60K sentences in total. We separate those data into three
sets, 58,043 sentences for training, 1,200 sentences for development and 1,200 sentences for
testing. We defined six types of NE tags for manual annotation: PNAME, LOC, ORG, RACE,
TIME and NUM.
PNAME tag is used to indicate person names including nickname or alias, while LOC tag is
defined for location entities. In this case, location entities include politically or geographically
defined places (cities, provinces, countries, international regions, bodies of water, mountain, etc.).
Location also includes man-made structures like airports, highways, streets, factories and
monuments. Names of organizations (government and non-government organizations,
corporations, institutions, agencies, companies and other groups of people defined by an
established organizational structure) are annotated with ORG tag. In our Myanmar language,
7. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
7
some location names and names of national races have same spelling in writing scripts. For
example, the location name “ကရင်” (Kayin State) and one of the national races “ကရင်” (Kayin
race). For this reason, the NE tag RACE is defined to indicate names of national races. TIME is
used for dates, months and years. NUM tag is used to indicate number format in sentences.
Table 1 lists the entities distribution in our in-house NE corpus.
Table 1. Corpus Data Statistics.
NE tags
Number of entities
Train Dev Test
PNAME 34262 622 517
LOC 60910 1365 1211
ORG 19084 375 281
RACE 5359 200 161
TIME 28385 556 530
NUM 19505 363 433
5.2.1. TAGGING SCHEME
In order to convert the NER problem into a sequence labeling problem, a label is assigned for
each token (syllable) to indicate the NE in sentence. As tagging scheme, IOBES (Inside, Outside,
Begin, End and Single) scheme is used for all the experiments.
5.3. EXPERIMENTS WITH CRF
For syllable-based CRF trainings, an open source toolkit for linear chain CRF [22] was used.
Experiments were conducted by tuning various parameters with different features. Firstly, we
only used the tokens and their neighbouring contents as features and set the window size as 5 and
the best F-score of 89.05% is obtained when the cut-off threshold parameter is 3 and hyper-
parameter c is 2.5. We also tried to add a small-sized named dictionary and clue words list as
additional features into the experiment. The additional features make the F-score have around 2.4
% increase (91.47 %) compared to the F-score of 89.05 % (Table 2). It shows that CRF works the
best when feature engineering is carefully prepared.
Table 2. Results on baseline CRF
CRF models Precision Recall F-measure
Syllable-based 89.75 88.36 89.05
Syllable-based+clue word
feature+name list
90.92 92.02 91.47
5.4. NEURAL MODEL TRAINING
Given a Myanmar sentence, syllable is treated as the basic training unit for label tagging. We first
learn the input representation of each input token, and then the learned representation was fed into
BiLSTM network for sequence leaning. On top of the network, a CRF inference layer determines
the tag with the maximum probability of the syllable (see Figure. 4).
For the implementation of the neural network model training, we utilized the PyTorch framework
[23] which provides flexible choices of feature inputs and output structure. Experiments were run
on Tesla K80 GPU. Based on different parameter settings, the training time for each experiment
is different.
8. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
8
As to input embedding setting, we tried experiments on using pre-trained 100-dim of embedding
on syllable level and character level data, respectively. The data used for training the syllable and
character embedding includes 200K sentences in total. In input embedding, there are two parts:
character representation and syllable representation. Character embedding was tried to learn from
training data by CNN and also with bidirectional LSTM network. Given a training sequence, we
took syllables as basic training unit and syllables were projected into a d-dimension space and
initialized as dense vectors.
As to optimization, both the stochastic gradient descent algorithm (SGD) and Adam algorithm
were tried. For the SGD, it was performed with initial learning rate of 0.015 and momentum 0.1.
The learning decay rate was set as 0.05. For the Adam algorithm, the initial learning rate was set
as 0.0015. Both optimizations had batch sizes set as 30.
Figure 4. The architecture of our network
Early Stopping was used based on the performance on validation sets. A dropout of 0.5 was set
for both embedding and output layers to mitigate overfitting in the training process. The hidden
dimension was set to 200 in the whole experiment. For syllable-based BiLSTM_CRF which used
CNN in character representation experiments, the best accuracy appears at 82 epochs and for the
syllable-based BiLSTM_CRF which applied bidirectional LSTM on character representation
experiments, the best accuracy happens at 77 epochs.
5.5. EXPERIMENTAL RESULTS AND ANALYSIS
In Table 3, we listed the best experimental results from different model structures. For both
BiLSTM and BiLSTM_CRF on syllable level NER, we relied on a LSTM model or CNN model
to learn char features first and then concatenated it with the syllables’ embeddings as the input of
to the BiLSTM or BiLSTM_CRF model. The char features learned from LSTM is not so good as
CNN in our experiments. When the GRU is applied on syllable, it makes the F-score value drops
continuously as the epoch increases, which is out of our expectation so we did not list the results
in Table 3. The performance of using SGD optimization algorithm is slightly worse than using
Adam, which you can see from Table 3. The random embedding also did not perform better than
the pre-trained word embedding from word2vec. So the results in the Table 3 are from pre-trained
embedding. For the syllable-based experiments, we can see that BiLSTM_CRF performs slightly
better than BiLSTM.
9. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
9
As to the syllable based system, since syllables as inputs contain more information than
individual characters as inputs, it is reasonable that it performs better than the character based
system. On syllable level data, by using CNN to extract character features from the data as
additional information inputs, better results are produced compared to not using character features
or using LSTM to extract character features (the best performance is highlighted in Table 3).
If compared the results in Table 3 with the results in Table 2, we can see that neural models
perform better than statistical CRF models. By comparison, syllable-based neural models without
additional feature engineering perform better than CRF models. Syllable-based CRF model with
additional feature is approaching the results of neural models. Although those experiments give
the promising results, the size of data used for neural network model training is not so big
compared to other NER corpus of other languages. Normally more data can help neural network
learn better. Moreover, due to time and data limit, the hyper-parameters used in the experiment
may be not the best. This needs modelling experiences and also large amount of trial and error
experiments to decide.
Table 3. F-score results comparison between different neural models.
Models
Precision/Recall/F-measure
Dev Test
BiLSTM(SGD) 89.40/88.48/88.94 89.99/89.97/89.98
BiLSTM (Adam) 89.32/88.47/88.89 90.71/91.64/91.17
BiLSTM_CRF(Adam)
with CharCNN
91.06/90.45/90.75 94.66/94.99/94.82
BiLSTM_CRF(Adam)
with CharLSTM
90.17/89.68/89.92 94.71/94.83/94.77
BiLSTM_CRF(SGD)
with CharCNN
89.57/90.37/89.97 93.01/94.19/93.61
BiLSTM_CRF(SGD)
with CharLSTM
84.82/85.65/85.65 90.24/91.83/91.03
6. CONCLUSION
In this paper, we have explored the effectiveness of neural network on Myanmar NER and
conducted a systematic comparison between neural approaches and traditional CRF approaches
on our manually annotated NE corpus. Experiment results revealed that the performance of neural
networks on Myanmar NER is quite promising, because neural models did not use any
handcrafted features or additional resources. Although our NE corpus is not so big, neural
network models produce better performance than CRF models for Myanmar NER, we still
believe with more data and more experiments, neural networks can learn better so as to produce
better results. From the experiments, we can see that neural network performs much better while
in the experiments of using CRF models, only by adding additional name list features and clue
word list, produced the similar accuracy as the syllable-based neural models.
Anyway, this exploration of using neural networks for Myanmar NER is the first work to apply
neural networks on Myanmar language. It showed us that BiLSTM_CRF network on syllable
level data jointly with CNN to extract character feature can facilitate Myanmar NER. With more
data and more experiments, better results will be reported in the future and we will keep exploring
neural networks on other Myanmar NLP work, e.g., POS tagging and word segmentation, too.
10. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
10
ACKNOWLEDGEMENTS
The ASEAN IVO (http://www.nict.go.jp/en/asean_ivo/index.html) project “Open Collaboration
for Developing and Using Asian Language Treebank” was involved in the production of the
contents of this publication and financially supported by NICT
(http://www.nict.go.jp/en/index.html).
REFERENCES
[1] Alireza Mansouri, Lilly Suriani Affendey, Ali Mamat, (2008), “Named Entity Recognition
Approaches”, Proceedings of IJCSNS International Journal of Computer Science and Network
Security 8(2), 339–344.
[2] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P.
Kuksa, (2011) “Natural language processing (almost) from scratch”, CoRR, abs/1103.0398.
[3] Zhiheng Huang, Wei Xu, and Kai Yu, (2015), “Bidirectional LSTM-CRF models for sequence
tagging”, CoRR, abs/1508.01991.
[4] Thi Thi Swe, Hla Hla Htay, (2010), “A Hybrid Methods for Myanmar Named Entity Identification
and Transliteration into English”, http://www.lrec-
conf.org/proceedings/lrec2010/workshops/W16.pdf.
[5] Thida Myint, Aye Thida, (2014), “Named Entity Recognition and Transliteration in Myanmar Text”,
PhD Research, University of Computer Studies, Mandalay.
[6] Mo H.M., Nwet K.T., Soe K.M. (2017) “CRF-Based Named Entity Recognition for Myanmar
Language”. In: Pan JS., Lin JW., Wang CH., Jiang X. (eds) Genetic and Evolutionary Computing.
ICGEC 2016. Advances in Intelligent Systems and Computing, vol 536. Springer, Cham.
[7] Guillaume Lample, Miguel Balesteros, Sandeep Subramanian, Kazuya Kawakami and Chris Dyer,
(2016), “Neural Architecture for Named Entity Recognition”, Proceedings of NAACL-HLT 2016,
pages 260–270, Association for Computational Linguistics.
[8] Jason P.C. Chiu and Eric Nichols, (2016), “Named Entity Recognition with Bidirectional LSTM-
CNNs”, Transactions of the Association for Computational Linguistics, vol. 4, pp. 357–370.
[9] Daniele Bonadiman, Aliaksei Severyn and Alessandro Moschitti, (2015), “Deep Neural Networks for
Named Entity Recognition in Italian”.
[10] Weihua Wang, Feilong Bao and Guanglai Gao, (2016), “Mongolian Named Entity Recognition with
Bidirectional Recurrent Neural Networks”, IEEE 28th International Conference on Tools with
Artificial Intelligence.
[11] Shotaro Misawa, Motoki Taniguchi, Yasuhide Miura and Tomoko Ohkuma, (2017), “Character-based
Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition”,
Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 97–102,
Association for Computational Linguistics.
[12] Anh L.T, Arkhipov M.Y., and Burtsev M.S., (2017), “Application of a Hybrid Bi-LSTM-CRF model
to the task of Russian Named Entity Recognition”.
[13] Antonio Jimeno Yepes and Andrew MacKinlay, (2016), “NER for Medical Entities in Twitter using
Sequence Neural Networks”, Proceedings of Australasian Language Technology Association
Workshop, pages 138−142.
11. International Journal on Natural Language Computing (IJNLC) Vol.8, No.1, February 2019
11
[14] Nut Limsopatham and Nigel Collier, (2016), “Bidirectional LSTM for Named Entity Recognition in
Twitter Messages”, Proceedings of the 2nd Workshop on Noisy User-generated Text, pages 145–152.
[15] Lishuang Li, Like Jin, Zhenchao Jiang, Dingxin Song and Degen Huang, (2015), “Biomedical Named
Entity Recognition Based on Extended Recurrent Newual Networks”, IEEE International Conference
on Bioinfonnatics and Biomedicine.
[16] Zhehuan Zhao, Zhihao Yang, Ling Luo, Yin Zhang, Lei Wang, Hongfei Lin and Jian Wang, (2015),
“ML-CNN: a novel deep learning based disease named entity recognition architecture”, IEEE
International Conference on Bioinfonnatics and Biomedicine.
[17] Zin Maung Maung and Yoshiki Mikami, (2008), “A rulebased syllable segmentation of myanmar
text”, In IJCNLP, Workshop on NLP for Less Privileged Languages, pages 51–58.
[18] Tin Htay Hlaing and Yoshiki Mikami, (2014), “Automatic syllable segmentation of myanmar texts
using finite state transducer”, ICTer, 6(2).
[19] Yoshua Bengio, Patrice Simard, and Paolo Frasconi, (1994), “Learning long-term dependencies with
gradient descent is difficult”, IEEE transactions on neural networks, 5(2):157-166.
[20] Sepp Hochreiter and J¨urgen Schmidhuber, (1997), “Long short-term memory”, Neural computation,
9(8):1735–1780.
[21] Ye Kyaw Thu, (2017), “Syllable segmentation tool for myanmar language (burmese)”, https://
github.com/ye-kyaw-thu/sylbreak.
[22] Taku Kudo, (2005), “Crf++: Yet another crf toolkit”, Software available at http://crfpp. sourceforge.
net.
[23] Jie Yang, (2017), https://github.com/jiesutd/PyTorchSeqLabel
[24] Nils Remers and Iryna Gurevych, (2017), “Optimal Hyperparameters for Deep LSTM-Networks for
Sequence Labeling Tasks”, EMNLP.
[25] Xuezhe Ma and Eduard Hovy, (2016), “ End-to-end Sequence Labeling via Bi-directional LSTM-
CNNs-CRF”, ACL 2016, Berlin, Germany.
AUTHORS
Hsu Myat Mo got her B.C.Sc (Hons) in 2010, followed by M.C.Sc (Credit:) in 2012,
respectively. Currently she is doing her Ph.D research focusing on Myanmar NER in
Natural Language Processing Lab, at the University of Computer Studies, Yangon.
Dr. Khin Mar Soe got Ph.D(IT) in 2005. Currently she is working as a Professor and also
the Head of Natural Language Processing Lab, at the University of Computer Studies,
Yangon. She has been supervising Master thesis and Ph.D researches on Natural Language
Processing. Moreover, she participated in the project of ASEAN MT, the machine
translation project for South East Asian languages.