N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
Word completion and word prediction are two important phenomena in typing that benefit users who type
using keyboard or other similar devices. They can have profound impact on the typing of disable people.
Our work is based on word prediction on Bangla sentence by using stochastic, i.e. N-gram language model
such as unigram, bigram, trigram, deleted Interpolation and backoff models for auto completing a sentence
by predicting a correct word in a sentence which saves time and keystrokes of typing and also reduces
misspelling. We use large data corpus of Bangla language of different word types to predict correct word
with the accuracy as much as possible. We have found promising results. We hope that our work will
impact on the baseline for automated Bangla typing.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
Word completion and word prediction are two important phenomena in typing that benefit users who type
using keyboard or other similar devices. They can have profound impact on the typing of disable people.
Our work is based on word prediction on Bangla sentence by using stochastic, i.e. N-gram language model
such as unigram, bigram, trigram, deleted Interpolation and backoff models for auto completing a sentence
by predicting a correct word in a sentence which saves time and keystrokes of typing and also reduces
misspelling. We use large data corpus of Bangla language of different word types to predict correct word
with the accuracy as much as possible. We have found promising results. We hope that our work will
impact on the baseline for automated Bangla typing.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
Brill's Rule-based Part of Speech Tagger for Kadazanidescitation
This paper presents the Part of Speech Tagger (POS) for Kadazan language by
implementing Brill's approach which is also known as a Transformation-Based Error
Driven Learning approach. Kadazan language is chosen because there is not even one POS
tagger has been developed for this language yet. Hence, this study has been carried out in
order to develop a POS tagger especially for Kadazan language that can tag Kadazan
corpus systematically, help to reduce the ambiguity problem and at the same time can be
used as a learning language tool. Therefore, the main objective of this study is to automate
the tagging process for Kadazan language. Brill' approach is an enhance version of the
original Rule-Based approach which it transforms the tags based on a set of predefined
rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In
order to achieve the main goal, several objectives have been set which are to create the
specific lexical and contextual rules for Kadazan language, by applying Brill’s approach
based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s
approach. The tagging process is divided into four main phases. In first phase, Brill’s
approach process begins by inputting a new untagged text into the system. In second phase,
the input text will go through the initial state annotater to tag all the words inside the corpus
to its most likely tags and produce a temporary corpus. In third phase, the temporary
corpus is then compared to the goal corpus to detect if there is any errors occurred. In last
phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus.
The tagging approach has been trained using two Kadazan children’s story books which
contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s
approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93
% of accuracy. This study has shown how Brill’s tagging approach can be used to identify
tags for Kadazan language.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
A scoring rubric for automatic short answer grading systemTELKOMNIKA JOURNAL
During the past decades, researches about automatic grading have become an interesting issue. These studies focuses on how to make machines are able to help human on assessing students’ learning outcomes. Automatic grading enables teachers to assess student's answers with more objective, consistent, and faster. Especially for essay model, it has two different types, i.e. long essay and short answer. Almost of the previous researches merely developed automatic essay grading (AEG) instead of automatic short answer grading (ASAG). This study aims to assess the sentence similarity of short answer to the questions and answers in Indonesian without any language semantic's tool. This research uses pre-processing steps consisting of case folding, tokenization, stemming, and stopword removal. The proposed approach is a scoring rubric obtained by measuring the similarity of sentences using the string-based similarity methods and the keyword matching process. The dataset used in this study consists of 7 questions, 34 alternative reference answers and 224 student’s answers. The experiment results show that the proposed approach is able to achieve a correlation value between 0.65419 up to 0.66383 at Pearson's correlation, with Mean Absolute Error (푀퐴퐸) value about 0.94994 until 1.24295. The proposed approach also leverages the correlation value and decreases the error value in each method.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
A performance of svm with modified lesk approach for word sense disambiguatio...eSAT Journals
Abstract WSD is a Technique used to find the correct meaning of a given word in any human language. Each human language has a problem called ambiguity of a word. To finds the correct meaning of any ambiguous word is easy for human but for a machine it is great issues No of work has done on WSD but not enough in Hindi language. My objective is to provide the training to the system so that it can easily find the correct meaning of the any ambiguous word in Hindi language .For this purpose I simple used a one existing technique name modified lesk approach and give its output to the SVM to get the better result and show that SVM is better in compare to modified Lesk Approach, In this paper I simply take nine Hindi ambiguous words and three different databases to show the result. Keywords: Support Vector Machine, NLP, Word Sense Disambiguation, Modified Lesk approach, Comparison
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
Grammatical error correction (GEC) greatly benefits from large quantities of high-quality training data.
However, the preparation of a large amount of labelled training data is time-consuming and prone to
human errors. These issues have become major obstacles in training GEC systems. Recently, the
performance of English GEC systems has drastically been enhanced by the application of deep neural
networks that generate a large amount of synthetic data from limited samples. While GEC has extensively
been studied in languages such as English and Chinese, no attempts have been made to generate synthetic
data for improving Persian GEC systems. Given the substantial grammatical and semantic differences of
the Persian language, in this paper, we propose a new deep learning framework to create large enough
synthetic sentences that are grammatically incorrect for training Persian GEC systems. A modified version
of sequence generative adversarial net with policy gradient is developed, in which the size of the model is
scaled down and the hyperparameters are tuned. The generator is trained in an adversarial framework on
a limited dataset of 8000 samples. Our proposed adversarial framework achieved bilingual evaluation
understudy (BLEU) scores of 64.5% on BLEU-2, 44.2% on BLEU-3, and 21.4% on BLEU-4, and
outperformed the conventional supervised-trained long short-term memory using maximum likelihood
estimation and recently proposed sequence labeler using neural machine translation augmentation. This
shows promise toward improving the performance of GEC systems by generating a large amount of
training data.
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
Grammatical error correction (GEC) greatly benefits from large quantities of high-quality training data.
However, the preparation of a large amount of labelled training data is time-consuming and prone to
human errors. These issues have become major obstacles in training GEC systems. Recently, the
performance of English GEC systems has drastically been enhanced by the application of deep neural
networks that generate a large amount of synthetic data from limited samples. While GEC has extensively
been studied in languages such as English and Chinese, no attempts have been made to generate synthetic
data for improving Persian GEC systems. Given the substantial grammatical and semantic differences of
the Persian language, in this paper, we propose a new deep learning framework to create large enough
synthetic sentences that are grammatically incorrect for training Persian GEC systems. A modified version
of sequence generative adversarial net with policy gradient is developed, in which the size of the model is
scaled down and the hyperparameters are tuned. The generator is trained in an adversarial framework on
a limited dataset of 8000 samples. Our proposed adversarial framework achieved bilingual evaluation
understudy (BLEU) scores of 64.5% on BLEU-2, 44.2% on BLEU-3, and 21.4% on BLEU-4, and
outperformed the conventional supervised-trained long short-term memory using maximum likelihood
estimation and recently proposed sequence labeler using neural machine translation augmentation. This
shows promise toward improving the performance of GEC systems by generating a large amount of
training data.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
An on-going project on Natural Language Processing (using Python and the NLTK toolkit), which focuses on the extraction of sentiment from a Question and its title on www.stackoverflow.com and determining the polarity.Based on the above findings, it is verified whether the rules and guidelines imposed by the SO community on the users are strictly followed or not.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Compression-Based Parts-of-Speech Tagger for The Arabic LanguageCSCJournals
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
A scoring rubric for automatic short answer grading systemTELKOMNIKA JOURNAL
During the past decades, researches about automatic grading have become an interesting issue. These studies focuses on how to make machines are able to help human on assessing students’ learning outcomes. Automatic grading enables teachers to assess student's answers with more objective, consistent, and faster. Especially for essay model, it has two different types, i.e. long essay and short answer. Almost of the previous researches merely developed automatic essay grading (AEG) instead of automatic short answer grading (ASAG). This study aims to assess the sentence similarity of short answer to the questions and answers in Indonesian without any language semantic's tool. This research uses pre-processing steps consisting of case folding, tokenization, stemming, and stopword removal. The proposed approach is a scoring rubric obtained by measuring the similarity of sentences using the string-based similarity methods and the keyword matching process. The dataset used in this study consists of 7 questions, 34 alternative reference answers and 224 student’s answers. The experiment results show that the proposed approach is able to achieve a correlation value between 0.65419 up to 0.66383 at Pearson's correlation, with Mean Absolute Error (푀퐴퐸) value about 0.94994 until 1.24295. The proposed approach also leverages the correlation value and decreases the error value in each method.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Intent Classifier with Facebook fastText
Facebook Developer Circle, Malang
22 February 2017
This is slide for Facebook Developer Circle meetup.
This is for beginner.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
A performance of svm with modified lesk approach for word sense disambiguatio...eSAT Journals
Abstract WSD is a Technique used to find the correct meaning of a given word in any human language. Each human language has a problem called ambiguity of a word. To finds the correct meaning of any ambiguous word is easy for human but for a machine it is great issues No of work has done on WSD but not enough in Hindi language. My objective is to provide the training to the system so that it can easily find the correct meaning of the any ambiguous word in Hindi language .For this purpose I simple used a one existing technique name modified lesk approach and give its output to the SVM to get the better result and show that SVM is better in compare to modified Lesk Approach, In this paper I simply take nine Hindi ambiguous words and three different databases to show the result. Keywords: Support Vector Machine, NLP, Word Sense Disambiguation, Modified Lesk approach, Comparison
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
Grammatical error correction (GEC) greatly benefits from large quantities of high-quality training data.
However, the preparation of a large amount of labelled training data is time-consuming and prone to
human errors. These issues have become major obstacles in training GEC systems. Recently, the
performance of English GEC systems has drastically been enhanced by the application of deep neural
networks that generate a large amount of synthetic data from limited samples. While GEC has extensively
been studied in languages such as English and Chinese, no attempts have been made to generate synthetic
data for improving Persian GEC systems. Given the substantial grammatical and semantic differences of
the Persian language, in this paper, we propose a new deep learning framework to create large enough
synthetic sentences that are grammatically incorrect for training Persian GEC systems. A modified version
of sequence generative adversarial net with policy gradient is developed, in which the size of the model is
scaled down and the hyperparameters are tuned. The generator is trained in an adversarial framework on
a limited dataset of 8000 samples. Our proposed adversarial framework achieved bilingual evaluation
understudy (BLEU) scores of 64.5% on BLEU-2, 44.2% on BLEU-3, and 21.4% on BLEU-4, and
outperformed the conventional supervised-trained long short-term memory using maximum likelihood
estimation and recently proposed sequence labeler using neural machine translation augmentation. This
shows promise toward improving the performance of GEC systems by generating a large amount of
training data.
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGEkevig
Grammatical error correction (GEC) greatly benefits from large quantities of high-quality training data.
However, the preparation of a large amount of labelled training data is time-consuming and prone to
human errors. These issues have become major obstacles in training GEC systems. Recently, the
performance of English GEC systems has drastically been enhanced by the application of deep neural
networks that generate a large amount of synthetic data from limited samples. While GEC has extensively
been studied in languages such as English and Chinese, no attempts have been made to generate synthetic
data for improving Persian GEC systems. Given the substantial grammatical and semantic differences of
the Persian language, in this paper, we propose a new deep learning framework to create large enough
synthetic sentences that are grammatically incorrect for training Persian GEC systems. A modified version
of sequence generative adversarial net with policy gradient is developed, in which the size of the model is
scaled down and the hyperparameters are tuned. The generator is trained in an adversarial framework on
a limited dataset of 8000 samples. Our proposed adversarial framework achieved bilingual evaluation
understudy (BLEU) scores of 64.5% on BLEU-2, 44.2% on BLEU-3, and 21.4% on BLEU-4, and
outperformed the conventional supervised-trained long short-term memory using maximum likelihood
estimation and recently proposed sequence labeler using neural machine translation augmentation. This
shows promise toward improving the performance of GEC systems by generating a large amount of
training data.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
Word completion and word prediction are two important phenomena in typing that benefit users who type
using keyboard or other similar devices. They can have profound impact on the typing of disable people.
Our work is based on word prediction on Bangla sentence by using stochastic, i.e. N-gram language model
such as unigram, bigram, trigram, deleted Interpolation and backoff models for auto completing a sentence
by predicting a correct word in a sentence which saves time and keystrokes of typing and also reduces
misspelling. We use large data corpus of Bangla language of different word types to predict correct word
with the accuracy as much as possible. We have found promising results. We hope that our work will
impact on the baseline for automated Bangla typing.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Improving the role of language model in statistical machine translation (Indo...IJECEIAES
The statistical machine translation (SMT) is widely used by researchers and practitioners in recent years. SMT works with quality that is determined by several important factors, two of which are language and translation model. Research on improving the translation model has been done quite a lot, but the problem of optimizing the language model for use on machine translators has not received much attention. On translator machines, language models usually use trigram models as standard. In this paper, we conducted experiments with four strategies to analyze the role of the language model used in the Indonesian-Javanese translation machine and show improvement compared to the baseline system with the standard language model. The results of this research indicate that the use of 3-gram language models is highly recommended in SMT.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
A Context-based Numeral Reading Technique for Text to Speech Systems IJECEIAES
This paper presents a novel technique for context based numeral reading in Indian language text to speech systems. The model uses a set of rules to determine the context of the numeral pronunciation and is being integrated with the waveform concatenation technique to produce speech out of the input text in Indian languages. For this purpose, the three Indian languages Odia, Hindi and Bengali are considered. To analyze the performance of the proposed technique, a set of experiments are performed considering different context of numeral pronunciations and the results are compared with existing syllable-based technique. The results obtained from different experiments shows the effectiveness of the proposed technique in producing intelligible speech out of the entered text utterances compared to the existing technique even with very less storage and execution time.
Machine Learning Techniques with Ontology for Subjective Answer Evaluationijnlc
Computerized Evaluation of English Essays is performed using Machine learning techniques like Latent
Semantic Analysis (LSA), Generalized LSA, Bilingual Evaluation Understudy and Maximum Entropy.
Ontology, a concept map of domain knowledge, can enhance the performance of these techniques. Use of
Ontology makes the evaluation process holistic as presence of keywords, synonyms, the right word
combination and coverage of concepts can be checked. In this paper, the above mentioned techniques are
implemented both with and without Ontology and tested on common input data consisting of technical
answers of Computer Science. Domain Ontology of Computer Graphics is designed and developed. The
software used for implementation includes Java Programming Language and tools such as MATLAB,
Protégé, etc. Ten questions from Computer Graphics with sixty answers for each question are used for
testing. The results are analyzed and it is concluded that the results are more accurate with use of
Ontology.
A prior case study of natural language processing on different domain IJECEIAES
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
Developing an automatic parts-of-speech (POS) tagging for any new language is considered a necessary
step for further computational linguistics methodology beyond tagging, like chunking and parsing, to be
fully applied to the language. Many POS disambiguation technologies have been developed for this type of
research and there are factors that influence the choice of choosing one. This could be either corpus-based
or non-corpus-based. In this paper, we present a review of POS tagging technologies.
MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESmlaij
The proposed methodology presented in the paper deals with solving the problem of multilingual speech
recognition. Current text and speech recognition and translation methods have a very low accuracy in
translating sentences which contain a mixture of two or more different languages. The paper proposes a
novel approach to tackling this problem and highlights some of the drawbacks of current recognition and
translation methods.
Similar to An exploratory research on grammar checking of Bangla sentences using statistical language models (20)
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
This research is developing an incubator system that integrates the internet of things and artificial intelligence to improve care for premature babies. The system workflow starts with sensors that collect data from the incubator. Then, the data is sent in real-time to the internet of things (IoT) broker eclipse mosquito using the message queue telemetry transport (MQTT) protocol version 5.0. After that, the data is stored in a database for analysis using the long short-term memory network (LSTM) method and displayed in a web application using an application programming interface (API) service. Furthermore, the experimental results produce as many as 2,880 rows of data stored in the database. The correlation coefficient between the target attribute and other attributes ranges from 0.23 to 0.48. Next, several experiments were conducted to evaluate the model-predicted value on the test data. The best results are obtained using a two-layer LSTM configuration model, each with 60 neurons and a lookback setting 6. This model produces an R 2 value of 0.934, with a root mean square error (RMSE) value of 0.015 and a mean absolute error (MAE) of 0.008. In addition, the R 2 value was also evaluated for each attribute used as input, with a result of values between 0.590 and 0.845.
A review on internet of things-based stingless bee's honey production with im...IJECEIAES
Honey is produced exclusively by honeybees and stingless bees which both are well adapted to tropical and subtropical regions such as Malaysia. Stingless bees are known for producing small amounts of honey and are known for having a unique flavor profile. Problem identified that many stingless bees collapsed due to weather, temperature and environment. It is critical to understand the relationship between the production of stingless bee honey and environmental conditions to improve honey production. Thus, this paper presents a review on stingless bee's honey production and prediction modeling. About 54 previous research has been analyzed and compared in identifying the research gaps. A framework on modeling the prediction of stingless bee honey is derived. The result presents the comparison and analysis on the internet of things (IoT) monitoring systems, honey production estimation, convolution neural networks (CNNs), and automatic identification methods on bee species. It is identified based on image detection method the top best three efficiency presents CNN is at 98.67%, densely connected convolutional networks with YOLO v3 is 97.7%, and DenseNet201 convolutional networks 99.81%. This study is significant to assist the researcher in developing a model for predicting stingless honey produced by bee's output, which is important for a stable economy and food security.
A trust based secure access control using authentication mechanism for intero...IJECEIAES
The internet of things (IoT) is a revolutionary innovation in many aspects of our society including interactions, financial activity, and global security such as the military and battlefield internet. Due to the limited energy and processing capacity of network devices, security, energy consumption, compatibility, and device heterogeneity are the long-term IoT problems. As a result, energy and security are critical for data transmission across edge and IoT networks. Existing IoT interoperability techniques need more computation time, have unreliable authentication mechanisms that break easily, lose data easily, and have low confidentiality. In this paper, a key agreement protocol-based authentication mechanism for IoT devices is offered as a solution to this issue. This system makes use of information exchange, which must be secured to prevent access by unauthorized users. Using a compact contiki/cooja simulator, the performance and design of the suggested framework are validated. The simulation findings are evaluated based on detection of malicious nodes after 60 minutes of simulation. The suggested trust method, which is based on privacy access control, reduced packet loss ratio to 0.32%, consumed 0.39% power, and had the greatest average residual energy of 0.99 mJoules at 10 nodes.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
The performance of artificial intelligence in prostate magnetic resonance im...IJECEIAES
Prostate cancer is the predominant form of cancer observed in men worldwide. The application of magnetic resonance imaging (MRI) as a guidance tool for conducting biopsies has been established as a reliable and well-established approach in the diagnosis of prostate cancer. The diagnostic performance of MRI-guided prostate cancer diagnosis exhibits significant heterogeneity due to the intricate and multi-step nature of the diagnostic pathway. The development of artificial intelligence (AI) models, specifically through the utilization of machine learning techniques such as deep learning, is assuming an increasingly significant role in the field of radiology. In the realm of prostate MRI, a considerable body of literature has been dedicated to the development of various AI algorithms. These algorithms have been specifically designed for tasks such as prostate segmentation, lesion identification, and classification. The overarching objective of these endeavors is to enhance diagnostic performance and foster greater agreement among different observers within MRI scans for the prostate. This review article aims to provide a concise overview of the application of AI in the field of radiology, with a specific focus on its utilization in prostate MRI.
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
According to the World Health Organization (WHO), seventy million individuals worldwide suffer from epilepsy, a neurological disorder. While electroencephalography (EEG) is crucial for diagnosing epilepsy and monitoring the brain activity of epilepsy patients, it requires a specialist to examine all EEG recordings to find epileptic behavior. This procedure needs an experienced doctor, and a precise epilepsy diagnosis is crucial for appropriate treatment. To identify epileptic seizures, this study employed a convolutional neural network (CNN) based on raw scalp EEG signals to discriminate between preictal, ictal, postictal, and interictal segments. The possibility of these characteristics is explored by examining how well timedomain signals work in the detection of epileptic signals using intracranial Freiburg Hospital (FH), scalp Children's Hospital Boston-Massachusetts Institute of Technology (CHB-MIT) databases, and Temple University Hospital (TUH) EEG. To test the viability of this approach, two types of experiments were carried out. Firstly, binary class classification (preictal, ictal, postictal each versus interictal) and four-class classification (interictal versus preictal versus ictal versus postictal). The average accuracy for stage detection using CHB-MIT database was 84.4%, while the Freiburg database's time-domain signals had an accuracy of 79.7% and the highest accuracy of 94.02% for classification in the TUH EEG database when comparing interictal stage to preictal stage.
Analysis of driving style using self-organizing maps to analyze driver behaviorIJECEIAES
Modern life is strongly associated with the use of cars, but the increase in acceleration speeds and their maneuverability leads to a dangerous driving style for some drivers. In these conditions, the development of a method that allows you to track the behavior of the driver is relevant. The article provides an overview of existing methods and models for assessing the functioning of motor vehicles and driver behavior. Based on this, a combined algorithm for recognizing driving style is proposed. To do this, a set of input data was formed, including 20 descriptive features: About the environment, the driver's behavior and the characteristics of the functioning of the car, collected using OBD II. The generated data set is sent to the Kohonen network, where clustering is performed according to driving style and degree of danger. Getting the driving characteristics into a particular cluster allows you to switch to the private indicators of an individual driver and considering individual driving characteristics. The application of the method allows you to identify potentially dangerous driving styles that can prevent accidents.
Hyperspectral object classification using hybrid spectral-spatial fusion and ...IJECEIAES
Because of its spectral-spatial and temporal resolution of greater areas, hyperspectral imaging (HSI) has found widespread application in the field of object classification. The HSI is typically used to accurately determine an object's physical characteristics as well as to locate related objects with appropriate spectral fingerprints. As a result, the HSI has been extensively applied to object identification in several fields, including surveillance, agricultural monitoring, environmental research, and precision agriculture. However, because of their enormous size, objects require a lot of time to classify; for this reason, both spectral and spatial feature fusion have been completed. The existing classification strategy leads to increased misclassification, and the feature fusion method is unable to preserve semantic object inherent features; This study addresses the research difficulties by introducing a hybrid spectral-spatial fusion (HSSF) technique to minimize feature size while maintaining object intrinsic qualities; Lastly, a soft-margins kernel is proposed for multi-layer deep support vector machine (MLDSVM) to reduce misclassification. The standard Indian pines dataset is used for the experiment, and the outcome demonstrates that the HSSF-MLDSVM model performs substantially better in terms of accuracy and Kappa coefficient.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
2. Int J Elec & Comp Eng ISSN: 2088-8708
An exploratory research on grammar checking of Bangla… (Md. Riazur Rahman)
3245
is data sparseness, in which case LMs fails to approximate accurate probabilities due to limited training data.
Smoothing [13] is a technique that resolves this problem by adjusting the maximum likelihood estimator
to compensate for data sparseness. In practice, LMs are usually implemented in conjunction with
smoothing techniques for better performance. There are many smoothing techniques available out of which
Witten-Bell (WB) [14] and Kneser-Ney (KN) [15] are by far the two most effective and widely used
smoothing techniques.
A number of good works is done in Bangla in different problem domains of NLP, e.g.
autocomplete [16], autocorrection of spelling [17], word prediction [18]. Furthermore, there has been much
development in grammar checking research in many different languages. Nevertheless, being one of the top
ten spoken languages in the world [19], there has been little development in the Bangla language processing
specially in grammar checking. Though some efforts have been made, there are still plenty of rooms for
improvement. In [20] the authors presented an 𝑛-gram LM to design a Bangla grammar checker, where
the 𝑛-gram probability distributions of parts-of-speech (POS) tags of words are used as feature. A sentence is
detected as grammatically correct if the product of all the 𝑛-grams in the sentence is greater than zero
otherwise incorrect. Due to this, their method suffers from the data sparsity problem, which severely
degrades the performance of the system. Moreover, they used a very small corpus of only 5000 words to
build the 𝑛-gram model and tested the model on a test set of simple sentences. The authors in [21] presented
another 𝑛-gram based statistical technique for grammar checking. Rather than using probability of POS tags
of words this time 𝑛-gram probability distribution of words is used to train and test the system. To deal with
sparsity problem of 𝑛-gram models, they used WB smoothing with their 𝑛-gram model. They trained their
statistical 𝑛-gram model with a small experimental corpus of 1 million words with a test set of 1000 correct
and 1000 incorrect sentences. However, their approach did not clarify how the threshold between correct and
incorrect sentences is determined which is not a practical approach. Moreover, in our previous work [22],
a statistical method was proposed which used 𝑛-gram based LM combined with WB smoothing and backoff
technique to determine the grammatical correctness of simple Bangla sentences, which presented promising
results. Nevertheless, there are still room for improvement and further analysis are required to find
an enhanced, robust and well performing statistical grammar checking system for Bangla.
The issues mentioned above and facts motivated this work where a comprehensive comparative
study on the performance of WB and KN smoothing based LMs for the purpose of grammar checking of
Bangla sentences has been performed to find the best possible LM, settings and methods for the development
of a more accurate and robust grammar checker for Bangla. The presented technique was trained on a large
Bangla corpus of 20 million words collected from various online newspapers. An improved strategy is
proposed to determine appropriate threshold to distinguish between grammatical and ungrammatical
sentences. The threshold was finalized by performing cross validation on the training set and testing on
a separate validation set in two stages to ensure maximum optimality. The proposed method was tested on
an updated realistic and challenging test set of 15000 correct and 15000 incorrect sentences consisting of all
kinds of simple & complex sentences with varying lengths. The rest of the paper is organized as follows;
section 2 presents some theoretical background on 𝑛-gram based sentence probability calculation. Whereas
section 3 describes the methodology used for developing the system. Section 4 presents the experimental
results while section 5 concludes the paper.
2. STATISTICAL LANGUAGE MODELING
N-gram statistical LMs are very popularly used statistical methods for solving various NLP
problems.
2.1. N-gram language models
A language model (LM) is a probability distribution over all possible sentences or strings in
a language. Let’s assume that S denotes a sentence consisting of a specified sequence of words such that
S = w1 w2 w3… wk. An n-gram LM considers the word sequence or sentence to be a Markov process [23].
Its probability is calculated as,
𝑃(𝑆) = ∏ 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1)𝑘
𝑖=1 (1)
where 𝑛 refers to the order of the Markov process. When 𝑛 = 3 we call it a trigram LM which is estimated
using information about the co-occurrence of 3-tuples of words. The probability of 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) can
be calculated as,
𝑃(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) =
𝐶(𝑤 𝑖−𝑛+1…𝑤 𝑖)
∑ 𝐶(𝑤 𝑖−𝑛+1…𝑤 𝑖−1 𝑤 𝑖)𝑤
(2)
3. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 3244 - 3252
3246
where, 𝐶(𝑤𝑖−𝑛+1 … 𝑤𝑖) is the count of occurrences of word sequence 𝑤𝑖−𝑛+1 … 𝑤𝑖 and
∑ 𝐶(𝑤𝑖−𝑛+1 … 𝑤𝑖−1 𝑤𝑖)𝑤 indicates the sum of counts of all the 𝑛-grams that starts with 𝑤𝑖−𝑛+1 … 𝑤𝑖−1.
For example, let us consider the following Bangla sentence,
কাদের একটি আম খেদেদে [english] Kader ate a mango
(Kader ekti aam kheychey)
The probability of this sentence can be calculated using bigram LM with (1) as,
P(কাদের একটি আম খেদেদে) = P(কাদের|<s>)
* P(একটি|কাদের) * P(আম|একটি) *
P(খেদেদে|আম) * P(</s>| খেদেদে)
For the same English sentence,
P(Kader ate a mango) = P(Kader |<s>) * P(ate |
Kader) * P(a| ate) * P(mango |a) * P(</s>| mango)
In practice, to calculate the probability of a sentence a start token <s> and an end token </s> are used to
indicate the start and end of the sentence respectively.
2.2. Data sparsity problem
For any 𝑛-gram that appeared an adequate number of times, we might have a good estimate of its
probability. But because any corpus is limited, some perfectly acceptable word sequences are bound to be
missing from it. That means, there will be many cases in which correct 𝑛-gram sequences will be assigned
zero probability. For example, suppose in the training set the bigram একটি(ekti) আম(aam) occurs 5 times
but although correct there is zero occurrence of the similar bigram একটি(ekti) আদেল(apple). Now suppose
we have the following sentence in the test set,
কাদের একটি আদেল খেদেদে [english] Kader ate an apple
(Kader ekti apple kheychey)
Since the bigram একটি(ekti) আদেল(apple) has zero count in the training corpus, in the bigram
model the probability will be zero as P(আদেল(apple)|একটি(ekti)) = 0. Consequently, the probability of
the sentence will be, P(কাদের একটি আদেল খেদেদে) = 0. This probability will be zero since according
to (1) the sentence probability is calculated by multiplying the constituent 𝑛-gram probabilities and if one of
them is zero then total probability will be zero. Therefore, these zero-frequency 𝑛-gram sequences that do not
occur in the training data but appear in the test set poses great problem for simple 𝑛-gram models in accurate
probability estimation of the sentences.
2.3. Smoothing
Smoothing techniques are used to keep a LM from assigning zero probability to unseen word
sequences, and has become an indispensable part of any LM. In this work, we utilized the two most widely
used smoothing algorithms for language modelling namely Witten-Bell (WB) smoothing and Kneser-Ney
(KN) smoothing. Smoothing techniques are often implemented in conjunction with two useful strategies that
take advantage of the lower order 𝑛-grams for the calculation of higher order 𝑛-grams that yields zero or low
probabilities. These are backoff [24] and interpolation [25] strategies.
2.4. Witten-bell smoothing
Witten Bell (WB) smoothing compensates the counts of word sequences occurring once to estimate
the counts of zero frequency word sequences. Originally, WB smoothing algorithm was implemented as
a linear interpolation instance taking advantage of lower order 𝑛-gram counts.
𝑃 𝑊𝐵−𝑖𝑛𝑝(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) = 𝜆(𝑤𝑖−𝑛+1 … 𝑤𝑖−1)𝑃(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1)
+ [1 − 𝜆(𝑤𝑖−𝑛+1 … 𝑤𝑖−1)]𝑝 𝑊𝐵−𝑖𝑛𝑡𝑒𝑟𝑝(𝑤𝑖|𝑤𝑖−𝑛+2 … 𝑤𝑖−1) (3)
Here, 1 − 𝜆(𝑤𝑖−𝑛+1 … 𝑤𝑖−1) is the total probability mass that is discounted to all the zero 𝑛-grams
and 𝜆(𝑤𝑖−𝑛+1 … 𝑤𝑖−1) is the the leftover probability mass of for all non-zero count 𝑛-grams. With a little
adjustment the WB smoothing can be implemented as an instance of backoff language model. The backoff
version of WB smoothing can be written as:
𝑃 𝑊𝐵−𝑏𝑜(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) =
4. Int J Elec & Comp Eng ISSN: 2088-8708
An exploratory research on grammar checking of Bangla… (Md. Riazur Rahman)
3247
{
𝜆(𝑤𝑖−𝑛+1 … 𝑤𝑖−1) 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1), 𝑖𝑓 𝑐(𝑤𝑖−𝑛+1 … 𝑤𝑖) > 0
[1 − 𝜆(𝑤𝑖−𝑛+1 … 𝑤𝑖−1)]𝑃 𝑊𝐵−𝑏𝑜(𝑤𝑖|𝑤𝑖−𝑛+2 … 𝑤𝑖−1), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(4)
2.5. Kneser-ney smoothing
In Kneser-Ney (KN) smoothing the lower-order distribution that one combines with a higher-order
distribution is built on the intuition that rather than calculating the probability of a word proportional to its
number of occurences, it should be calculated based on the number of different words it follows. In its
original definition, Kneser and Ney defined KN smoothing as a backoff language model combining lower
order models with higher order model using backoff strategy as:
𝑝 𝐾𝑁−𝑏𝑜(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) =
{
𝑚𝑎𝑥{𝐶(𝑤 𝑖−𝑛+1…𝑤 𝑖)−𝑑,0}
𝐶(𝑤 𝑖−𝑛+1…𝑤 𝑖−1)
, 𝑖𝑓 𝑐(𝑤𝑖−𝑛+1 … 𝑤𝑖) > 0
𝜆(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1)𝑝 𝐾𝑁−𝑏𝑜(𝑤𝑖|𝑤𝑖−𝑛+2 … 𝑤𝑖−1), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(5)
where 𝜆(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) represent the backoff weights assigned to the lower order 𝑛-grams which
determine the impact of the lower order value on the result. The discount 𝑑 represents the amount of counts
that are discounted from each higher order 𝑛-grams. 𝑑 can be estimated based on the total number of 𝑛-grams
occurring exactly once (𝑛1) and twice (𝑛2) as 𝑑 =
𝑛1
𝑛1+2𝑛2
. The probability for the lower order 𝑛-grams can be
calculated as
𝑝 𝐾𝑁−𝑏𝑜(𝑤𝑖|𝑤𝑖−𝑛+2 … 𝑤𝑖−1) =
𝑇1+(𝑤 𝑖−𝑛+2…𝑤 𝑖)
𝑇1+(𝑤 𝑖−𝑛+2…𝑤 𝑖−1)
(6)
where, 𝑇1+ (𝑤𝑖−𝑛+2 … 𝑤𝑖) = | {𝑤𝑖−𝑛+1 : 𝐶 (𝑤𝑖−𝑛+1 … 𝑤𝑖) > 0} | and 𝑇1 + (𝑤𝑖−𝑛+2 … 𝑤𝑖−1) =
∑ 𝑇1+(𝑤𝑖−𝑛+2 … 𝑤𝑖)𝑤 𝑖
. With a little modification the interpolated version KN of can be defined as follows:
𝑃𝐾𝑁−𝑖𝑛𝑝(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1) =
𝑚𝑎𝑥{𝐶(𝑤 𝑖−𝑛+1…𝑤 𝑖)−𝑑,0}
𝐶(𝑤 𝑖−𝑛+1…𝑤 𝑖−1)
+
𝜆(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1)𝑃𝐾𝑁−𝑖𝑛𝑝(𝑤𝑖|𝑤𝑖−𝑛+2 … 𝑤𝑖−1) (7)
3. PROPOSED GRAMMAR CHEKCING METHODOLOGY
In this section we present the grammar checking methodology that we used to evaluate and analyse
the performances of smoothing algorithms. It is an updated version of the grammar checker we presented and
described in our previous work. The overall framework or workflow of the system is depicted in Figure 1.
The working procedure of the grammar checker consists of three main phases: Training phase, validation
phase and testing phase.
Figure 1. Work flow diagram for proposed the grammar checker
5. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 3244 - 3252
3248
The training process in the proposed system starts by accepting the training corpus and the 𝑛 value
as input. After accepting the input text and the 𝑛 value, possible 𝑛-gram patterns of words are extracted and
frequencies of 𝑛-grams are then calculated. Using these 𝑛-gram frequencies LMs are trained based on
the algorithms discussed in the previous sections. In the validation phase, a best possible threshold is
calculated for separting the correct or incorrect sentences. The validation process starts by accepting
a validation or heldout set consisting of a set of correct and incorrect test sentences. Then the probabilities of
these test sentences are calculated and a threshold value is determined that best separates the grammatical and
ungrammatical sentences. To do so first we need to define a method to calculate the sentence probability
properly which is discussed next.
3.1. Calculation of sentence probability
The sentence probability in 𝑛-gram LMs is usually calculated using (1) by first finding
the constituent 𝑛-grams in the sentence as shown in section 2.1. Since probabilities are by definition less than
or equal to 1, the more probabilities we multiply together, the smaller the product becomes. Due to that
sentence length (i.e. the number of word tokens in the sentence) has a negative effect on the probability of
a sentence. With larger length a sentence tends to have smaller probabilities even though having higher
probability constituent 𝑛-grams. So, a larger length correct sentence might have smaller probability than
a smaller length incorrect sentence because of this effect. To deal with this impact of sentence length on
sentence probability calculation a new sentence probability scoring function is introduced in this work
defined in (8) by normalizing the sentence probability in (1).
𝑃(𝑆) = √∏ 𝑃(𝑤𝑖|𝑤𝑖−𝑛+1 … 𝑤𝑖−1)𝑘
𝑖=1
𝑘
(8)
3.2. Optimal threshold calculation
In the validation phase, optimal threshold for the 𝑛-gram based classfier is calculated in two stages.
In the first stage, we used 10-fold cross validation on the training set which consists of only grammatically
correct senetcnes. Since, a correct sentence typically has a higher probability than an incorrect one, in each
fold we selected the lowest probability score among the sentences of training part as the threshold and used
that threshold to classify the test sentences and find the misclassification error with that threshold.
The threshold that has the minimum misclassifcaton error is finally chosen as the final threshold. The process
is an improved version to the process we used in our previous work. The process is explained in Algorithm 1.
Algorithm 1. Priliminary threshold selection from training set in stage 1
Input: S= training data set;
L = corresponding true labels of positive and negative sentences in VS
LM = language model to be used
1. Divide the data set into 10 equal sized subsets as S = {S1, S2,…., S10}
2. Set MCRmin= 1 //the minimum misclassification rate and
Set T = final threshold
3. For i = 1 to 10 Do,
4. Set Stest = Si and Strain = S - Si
5. Train the LM on Strain.
6. t = Find the minimum probability in Strain and set it as current threshold
7. probs = Test the LM on Stest using t as threshold..
8. mcr = Find the misclassification rate for the current threshold.
9. If mcr < MCRmin then Set MCRmin = mcr and T= t
10. End For
11. return T //T is the final threshold selected
Though methods in the first stage work well but they introduce a lot of false positives in the final
classification. Since we are using the minimum probability score of correct or positive sentences as threshold
it ensures high true positives but it adversely overlaps with a substantial number of incorrect sentences in
the probability distribution. Hence, the high false positive rates. To reduce the unwanted high number of false
positives and to improve the classification performance overall in the second stage we used a method that
gradually increases the threshold to reduce the number of false positives but also ensures the balance between
false positives and false negatives. This method is applied on a separate validation set consisting of equal
number of positive and negative sentences to finalize the optimal threshold. This process is explained in
Algorithm 2.
6. Int J Elec & Comp Eng ISSN: 2088-8708
An exploratory research on grammar checking of Bangla… (Md. Riazur Rahman)
3249
In the testing phase, the classification LMs are tested on a separate test set consisting of grammatical
and ungrammatical sentences using the optimum threshold calculated in the validation phase. If any
senetence has a probability less than the optimum threshold then it is classified as ungrammatical otherwise
grammatical.
Algorithm 2. Optimal threshold selection from validation set in stage 2
Input: t0 = preliminary threshold calculated from the training set using Algorithm 1;
VS = validation set
L = corresponding true labels of positive and negative sentences in VS
LM = language model to be used
1. Calculate [TP,FP,TN,FN] using t0 as threshold testing on VS where TP = no. of true positives, FP =
no. of false positives, TN = no. of true negatives, FN = no. of true positives
Calculate FPR = FP/(TN+FP), FNR =FN/(TP+FN) and MCR=FPR+FNR where FPR = false
positive rate, FNR = false negative rate and MCR = overall misclassification rate
2. Set th = t0 // th is the final threshold
Divide the range [t0, 1] into k equal sized thresholds in THS = { t1, t2,…., tk}
3. For each threshold t in THS Do,
4. Calculate [TP,FP,TN,FN] using t as threshold on VS and hence calculate the fprt and fnrt for t.
5. If fprt ≤ fnrt and MCR ≥ fprt + fnrt then,
6. Set th = t, FPR = fprt, FNR = fnrt and MCR = FPR+FNR
7. End If
8. End For
9 return th //th is the final threshold selected
4. RESULTS AND ANALYSIS
The main focus of this section is to investigate the performance of the grammar checking system
based on certain factors such as the smoothing algorithm used, 𝑛-gram orders, length of the target
sentences etc. To train and test the LMs we used a large corpus of 20 million words containing 181820
grammatically correct sentences. Around 80% of the corpus is used for training purpose. The validation set
consists of 20000 correct sentences and 20000 incorrect sentences. The grammatically incorrect sentences are
artificially created by inseting, deleting or replacing words in the correct sentences in the set. The test set
contains 15000 correct and 15000 incorrect sentences. In our previous work, we only tested the methods on
a test set containing only simple senetnces of length of 5-10 words. This time we tested the techniques on
a more difficult and practical test set consisting of all kinds of simple, complex and compound sentences with
lengths ranging from 5 to 20 words. The experiments have been tested on a machine with 2.40GHz Intel
Core i3 processor and 12 GB of RAM, running on Microsoft Windows 8. The experimental system has been
developed using python programming language. The comparative performances of the LMs were evaluated
by precision, recall and f-scores. The overall performances of the different LMs based on the smooting
techniques and 𝑛-gram order used are presented in Table 1.
Table 1 represents the results of different LMs for each metric (precision, recall & f-score) in two
columns. The gray shaded column represents the results obtained using the threshold selection method used
in our previous work [22] and the other column represents the results attained using our two stage threshold
selection procedure explained in Algorithm 1 and Algorithm 2, which is proposed in this work. Our newly
proposed two stage optimum threshold selection approach clearly provides significantly improved results for
all the LMs compared to the previous approach. It significantly increases the precision and hence the overall
f-score for all the LMs with the cost of small or insignificant reduction in recall values for grammatical
sentences. Similarly, for ungrammatical sentences the recall scores are significantly improved resulting in
much improved f-score with the negligible loss of precision values. This improved performance is due to
the reduction in false positives and also keeping a balance between false positives and false negatives. These
results prove the superiority of our proposed method compared to the previous one. From the newly found
results in Table 1 it is evident that, KN-interp with its 5-gram model clearly outperforms all the other LMs in
terms of precision, recall and f-score for both grammatical and ungrammatical sentences achieving highest
f-scores of 72.92% and 68.51% respectively. In terms of f-score as we can see from the Table 1, WB-backoff
produces the second best results for both grammatical and ungrammatical sentences with KN-backoff model
providing the third best performance. The models rank similiarly in terms of precision and recall with one or
two exceptions such as for recall metric KN-backoff performs slightly better than WB-backoff.
7. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 3244 - 3252
3250
Table 1. Performances of different LMs
Models N-gram
Order
Performances with Grammatical Data Performances with Ungrammatical Data
Precision Recall F-score Precision Recall F-score
With T1 With T2 With T1 With T2 With T1 With T2 With T1 With T2 With T1 With T2 With T1 With T2
WB-
backoff
2 34.56% 42.10% 52.67% 51.54% 41.74% 46.35% 38.69% 37.53% 24.89% 29.11% 30.29% 32.79%
3 52.31% 61.81% 73.36% 71.57% 61.07% 66.33% 68.29% 66.24% 50.76% 55.78% 58.24% 60.56%
4 55.43% 66.48% 75.76% 73.55% 64.02% 69.84% 72.58% 70.40% 56.32% 62.91% 63.43% 66.45%
5 56.31% 66.42% 75.29% 74.25% 64.43% 70.12% 73.01% 70.81% 55.89% 62.46% 63.31% 66.37%
WB-
interp
2 31.87% 40.45% 51.51% 50.55% 39.38% 44.94% 35.15% 34.09% 20.32% 25.58% 25.75% 29.23%
3 52.12% 60.70% 69.47% 67.91% 59.56% 64.10% 65.55% 63.58% 47.98% 56.02% 55.41% 59.56%
4 53.11% 64.38% 73.85% 72.26% 61.79% 68.10% 70.52% 68.40% 50.21% 60.03% 58.66% 63.94%
5 55.02% 64.70% 75.34% 73.36% 63.60% 68.76% 71.39% 69.24% 53.41% 59.97% 61.10% 64.27%
KN-
backoff
2 32.12% 38.81% 50.12% 48.61% 39.15% 43.16% 32.22% 31.25% 19.97% 23.36% 24.66% 26.73%
3 49.18% 59.64% 69.92% 68.02% 57.75% 63.56% 64.75% 62.80% 47.65% 53.97% 54.90% 58.05%
4 51.55% 62.61% 76.56% 74.84% 61.61% 68.18% 70.86% 68.73% 50.33% 55.31% 58.86% 61.29%
5 52.44% 64.01% 78.93% 77.00% 63.01% 69.91% 73.35% 71.14% 50.91% 56.71% 60.10% 63.11%
KN-
interp
2 35.76% 44.79% 56.30% 55.36% 43.74% 49.52% 42.86% 41.57% 25.87% 31.76% 32.26% 36.01%
3 52.89% 62.61% 74.60% 72.99% 61.90% 67.40% 69.72% 67.62% 48.79% 56.42% 57.41% 61.52%
4 57.09% 67.18% 77.38% 76.46% 65.70% 71.52% 73.63% 72.69% 55.88% 62.64% 63.54% 67.29%
5 58.71% 68.15% 79.51% 78.41% 67.54% 72.92% 75.70% 74.58% 56.10% 63.35% 64.44% 68.51%
*Here, T1 is the threshold calculated using the threshold selection algorithm defined in our previous work [22]. T2 is the threshold
calculated using the two-stage threshold selection technique introduced in this work.
Performances of the LMs inprove with the growing order of 𝑛-gram and the performance
improvement gets lesser with each higher order. Though the performances of most of the LMs tend to
increase from 4-gram order to 5-gram order, the performance differences are very insignificant. Figure 2 and
Figure 3 depict this effect where the f-scores of the LMs varied by the 𝑛-gram order are presented for both
grammatical and ungrammatical sentences. Though not presented here, similar effects can be observed in
terms of precision and recall.
Since we are using a data set consisting of varied length of sentences, next we try to find out
whether sentence length has any effect on the performances of LMs. Figures 4 and 5 present the f-scores of
two of our best performing LMs, KN-interp and WB-backoff varied by the length of sentences tested for both
grammatical and ungrammatical data respectively. From Figures 4 and 5, we find that the performances of
the LMs gradually decrease with the increasing sentence length for the sentences. This is understandable
since sentences with more words or higher length will tend to be more complex in structure and difficult to
be judged. But this degradation in performance is linear not exponential and changes are very small.
This shows the robustness of our sentence probability calculation function defined in (8). Though not
presented here, performances of other LMs (KN-backoff and WB-interp) and on other metrics shows similar
characteristics for the dependency of the method on sentence length. So, we can conclude that KN LM with
its interpolated version i.e. KN-interp outperforms all the other LMs in terms of all performance metrics.
With higher 𝑛-gram order the performances of the LMs improve with 4-gram and 5-gram models showing
similar performances with negligible diffrences and the length of the sentence does not affect
the performance of the LMs significantly.
Figure 2. Effect of 𝑛-gram order on the performances
of LMs for grammatical data
Figure 3. Effect of 𝑛-gram order on the performances
of LMs for ungrammatical data
8. Int J Elec & Comp Eng ISSN: 2088-8708
An exploratory research on grammar checking of Bangla… (Md. Riazur Rahman)
3251
Figure 4. Effect of sentence length on the
performances of LMs for grammatical data
Figure 5. Effect of sentence length on the
performances of LMs for ungrammatical data
5. CONCLUSION
The goal of this research was to design and develop a robust grammar checking system for Bangla
language which can accurately judge realistic, simple and complex sentences for grammaticality. To attain
that extent, a statistical grammar checking system based on 𝑛-gram language modelling has been designed
and developed. To achieve robust performance with 𝑛-gram models two most widely used smoothing
techniques namely Kneser-ney and Witten-bell were used and compared to find best performing system.
Furthermore, the LMs’ performances were tested on a newly developed challenging test set containing 30000
all types of simple, complex and compound sentences to attain realistic performance results.
Our experimental results show that Kneser-ney interpolated smoothing based 5-gram LM outperforms others
in terms of all the metrics achieving f-scores of 72.92% and 68.51% for grammatical and ungrammatical data
respectively. For further this research work, more features such as parts of speech tags and other linguistic
features can be added to improve the performance of the system.
REFERENCES
[1] E. D. Liddy, "Natural language processing in Encyclopaedia of Library and Information Science," 2nd Edition,
Florida: CRC Press, pp. 1-20, 2001.
[2] Shafaf Ibrahim, Nurul Amirah Zulkifli, Nurbaity Sabri, Anis Amilah Shari, and Mohd Rahmat Mohd Noordin,
"Rice grain classification using multi-class support vector machine (SVM)," IAES International Journal of
Artificial Intelligence (IJ-AI), vol. 8, no. 3, pp. 215-220, Sep. 2019.
[3] Amirul Sadikin Md Affendi, Marina Yusoff, "Review of anomalous sound event detection approaches," IAES
International Journal of Artificial Intelligence (IJ-AI), vol. 8, no. 3, pp. 264-269, Sep. 2019.
[4] Cesar G. Pachon-Suescun, Carlos J. Enciso-Aragon, Robinson Jimenez-Moreno, "Robotic Navigation Algorithm
with Machine Vision," International Journal of Electrical and Computer Engineering (IJECE), vol. 10, no. 2,
pp. 1308-1316, Apr. 2020.
[5] K.A.F.A. Samah, I.M. Badarudin, E.E. Odzaly, K.N. Ismail, N.I.S. Nasarudin, N.F. Tahar, M.H. Khairuddin,
"Optimization of house purchase recommendation system (HPRS) using genetic algorithm," Indonesian Journal of
Electrical Engineering and Computer Science (IJEECS), vol. 16, no. 3, pp. 1530-1538, Dec. 2019.
[6] Vernon A., "Computerized grammar checkers 2000: Capabilities, limitations, and pedagogical possibilities,"
Computers and Composition, vol. 17, no. 3, pp. 329-49, Dec. 2000.
[7] Richardson, S., "Microsoft natural language understanding system and grammar checker," In Fifth Conference on
Applied Natural Language Processing: Descriptions of System Demonstrations and Videos, pp. 20-20, 1997
[8] Arppe A., "Developing a grammar checker for Swedish," In Proceedings of the 12th Nordic Conference of
Computational Linguistics (NODALIDA 1999), pp. 13-27, 2000.
[9] Shaalan KF., "Arabic GramCheck: A grammar checker for Arabic," Software: Practice and Experience, vol. 35,
no. 7, pp. 643-65, Jun. 2005
[10] Bopche L, Dhopavkar G, Kshirsagar M., "Grammar Checking System Using Rule Based Morphological Process for
an Indian Language," In Global Trends in Information Systems and Software Applications, Springer, Berlin,
Heidelberg, pp. 524-531, 2012.
[11] Jensen K, Heidorn GE, Richardson SD, "Natural language processing: the PLNLP approach," Springer Science &
Business Media, Dec. 2012.
[12] Manning CD, Raghavan P, Schutze H., "Introduction to Information Retrieval ‖," Cambridge University Press,
Ch. 20, pp. 405-416, 2008.
9. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 3244 - 3252
3252
[13] Martin JH, Jurafsky D., "Speech and language processing: An introduction to natural language processing,
computational linguistics, and speech recognition," Pearson/Prentice Hall, 2009.
[14] Witten IH, Bell TC., "The zero-frequency problem: Estimating the probabilities of novel events in adaptive text
compression," IEEE transactions on information theory, vol. 37, no. 4, pp. 1085-94, Jul. 1991.
[15] Kneser R, Ney H., "Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal
Processing," 1995 International Conference on Acoustics, Speech, and Signal Processing ICASSP-95, IEEE, vol. 1,
pp. 181-184, May 1995.
[16] Md. Iftakher Alam Eyamin, Md. Tarek Habib, M. Ifte Khairul Islam, Md. Sadekur Rahman, Md. Abbas Ali Khan,
"An Investigative Design of Optimum Stochastic Language Model for Bangla Autocomplete," Indonesian Journal
of Electrical Engineering and Computer Science (IJEECS), vol. 13, no. 2, pp. 671-676, 2019.
[17] Muhammad Ifte Khairul Islam, Md. Tarek Habib, Md. Sadekur Rahman and Md. Riazur Rahman, "A Context-
Sensitive Approach to Find Optimum Language Model for Automatic Bangla Spelling Correction," International
Journal of Advanced Computer Science and Applications, vol. 9, no. 11, pp. 184-191, 2018.
[18] Md. Tarek Habib, Abdullah Al-Mamun, Md. Sadekur Rahman, Shah Md. Tanvir Siddiquee and Farruk Ahmed,
"An Exploratory Approach to Find a Novel Metric Based Optimum Language Model for Automatic Bangla Word
Prediction," International Journal of Intelligent Systems and Applications (IJISA), vol. 10, no. 2, pp. 47-54, 2018.
[19] J. Lane, “The 10 Most Spoken Languages in the World,” Babbel Magazine, 2019. [Online], Available:
https://www.babbel.com/en/magazine/the-10-most-spoken-languages-in-the-world, [Accessed: 18 March 2018].
[20] Alam M. Jahangir, Naushad UzZaman, and Mumit Khan, "N-gram based Statistical Grammar Checker for
Bangla and English," In Proceeding of ninth International Conference on Computer and Information Technology
(ICCIT 2006), 2006.
[21] Nur Hossain Khan M, Khan F, Islam MM, Rahman MH, Sarke B., "Verification of Bangla Sentence Structure
using N-Gram," Global Journal of Computer Science and Technology, May 2014.
[22] Rahman MR, Habib MT, Rahman MS, Shuvo SB, Uddin MS., "An Investigative Design Based Statistical
Approach for Determining Bangla Sentence Validity," International Journal of Computer Science and Network
Security (IJCSNS), vol. 16, no. 11, pp. 30, Nov. 2016.
[23] Charniak E., "Statistical language learning," MIT press, 1996.
[24] Katz S., "Estimation of probabilities from sparse data for the language model component of a speech recognizer,"
IEEE transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 400-1, Mar. 1987.
[25] Jelinek F., "Interpolated estimation of Markov source parameters from sparse data," In Proc. Workshop on Pattern
Recognition in Practice, 1980.