Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing
(NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words
Term Extraction. This paper proposes an efficient and accurate POS Tagging technique for Arabic
language using hybrid approach. Due to the ambiguity issue, Arabic Rule-Based method suffers from
misclassified and unanalyzed words. To overcome these two problems, we propose a Hidden Markov
Model (HMM) integrated with Arabic Rule-Based method. Our POS tagger generates a set of three POS
tags: Noun, Verb, and Particle. The proposed technique uses the different contextual information of the
words with a variety of the features which are helpful to predict the various POS classes. To evaluate its
accuracy, the proposed method has been trained and tested with two corpora: the Holy Quran Corpus and
Kalimat Corpus for undiacritized Classical Arabic language. The experiment results demonstrate the
efficiency of our method for Arabic POS Tagging. In fact, the obtained accuracies rates are 97.6%, 96.8%
and 94.4% for respectively our Hybrid Tagger, HMM Tagger and for the Rule-Based Tagger with Holy
Quran Corpus. And for Kalimat Corpus we obtained 94.60%, 97.40% and 98% for respectively Rule-Based
Tagger, HMM Tagger and our Hybrid Tagger.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
This slides covers introduction about machine translation, some technique using in MT such as example based MT and statistical MT, main challenge facing us in machine translation, and some examples of application using in MT
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The research in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation
result based on huge amount of aligned parallel text corpora in both the source and target languages.
Although Bodo is a recognized natural language of India and co-official languages of Assam, still the
machine readable information of Bodo language is very low. Therefore, to expand the computerized
information of the language, English to Bodo SMT system has been developed. But this paper mainly
focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using
Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator
tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora.
Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques
in the SMT system.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing
(NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words
Term Extraction. This paper proposes an efficient and accurate POS Tagging technique for Arabic
language using hybrid approach. Due to the ambiguity issue, Arabic Rule-Based method suffers from
misclassified and unanalyzed words. To overcome these two problems, we propose a Hidden Markov
Model (HMM) integrated with Arabic Rule-Based method. Our POS tagger generates a set of three POS
tags: Noun, Verb, and Particle. The proposed technique uses the different contextual information of the
words with a variety of the features which are helpful to predict the various POS classes. To evaluate its
accuracy, the proposed method has been trained and tested with two corpora: the Holy Quran Corpus and
Kalimat Corpus for undiacritized Classical Arabic language. The experiment results demonstrate the
efficiency of our method for Arabic POS Tagging. In fact, the obtained accuracies rates are 97.6%, 96.8%
and 94.4% for respectively our Hybrid Tagger, HMM Tagger and for the Rule-Based Tagger with Holy
Quran Corpus. And for Kalimat Corpus we obtained 94.60%, 97.40% and 98% for respectively Rule-Based
Tagger, HMM Tagger and our Hybrid Tagger.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
This slides covers introduction about machine translation, some technique using in MT such as example based MT and statistical MT, main challenge facing us in machine translation, and some examples of application using in MT
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The research in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation
result based on huge amount of aligned parallel text corpora in both the source and target languages.
Although Bodo is a recognized natural language of India and co-official languages of Assam, still the
machine readable information of Bodo language is very low. Therefore, to expand the computerized
information of the language, English to Bodo SMT system has been developed. But this paper mainly
focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using
Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator
tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora.
Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques
in the SMT system.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Arabic language is the most spoken languages in the Semitic languages group, and one of the most common languages in the world spoken by more than 422 million. It is also of paramount importance to Muslims, it is a sacred language of the Islamic Holly Book (Quran) and prayer (and other acts of worship) in Islam is performed only by mastering some of Arabic words. Arabic is also a major ritual language of a number of Christian churches in the Arab world and it is also used in writing several intellectual and religious Jewish books in the Middle Ages. Despite this, there is no semantic Arabic lexicon which researchers can depend on. In this paper we introduce Azhary as a lexical ontology for the Arabic language. It groups Arabic words into sets of synonyms called synsets, and records a number of relationships between words such as synonym, antonym, hypernym, hyponym, meronym, holonym and association relations. The ontology contains 26,195 words organized in 13,328 synsets. It has been developed and contrasted against AWN which is the most common available Arabic lexical ontology.
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
Morphological analysis is an essential stage in language engineering applications. For the Arabic language, this stage is not easy to develop because the Arabic language has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. These reasons make the design of the morphological analyzer for Arabic somewhat difficult and require lots of other tools and treatments. The volume of the lexicon is another big problem of the morphological analysis of the Arabic Language which affects directly the process of the analyzing. In this paper we present a Morphological Analyzer for Modern Standard Arabic based on Arabic Morphological Automaton technique and using a new and innovative language (XMODEL) to represent the Arabic morphological knowledge in an optimal way. Both the Arabic Morphological Analyzer and Arabic Morphological Automaton are implemented in Java language and used XML technology. Buckwalter Arabic Morphological Analyzer and Xerox Arabic Finite State Morphology are two of the best known morphological analyzers for Modern Standard Arabic and they are also available and documented. Our Morphological Analyzer can be exploited by Natural Language Processing (NLP) applications such as machine translation, orthographical correction, information retrieval and both syntactic and semantic analyzers. At the end, an evaluation of Xerox and our system is done.
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...IJECEIAES
Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
People across the globe have access to materials such as journals, articles, adverts etc. via the internet. However
many of these resources come in diverse nature of languages. Although, English language seems most suitable to
most people, some readers do believe that working on materials in one’s native language is more enjoyable than in
other languages. Researches have shown that Arabic language has not been prominent in terms of online materials
and the few existing are most times ignored due to the peculiar nature of its various characters and constructs.
Hence, a proper study of its relationship with English language with a view to bringing people closer to its
understanding becomes necessary. The system scenarios were modeled and implemented using Unified Modeling
Language and Microsoft C# respectively in a way that the expected set of characters of the language of interest was
automatically formed with respect to a given input. The procedural steps were properly followed in the development
and running of the code using Context-Free Rule Based Technique with the availability of hardware required as
clearly described in the design. The system’s workability was tested with different source texts as inputs and in each
case the resulting outputs were very effective with respect to the translation process. The design here is expected to
serve as a tool for assisting beginners in these two languages and so, showcases a one-to-one form of
correspondence, hence, more rules and functions for ensuring a more robust are expected in future works.
Arabic words stemming approach using arabic wordnetIJDKP
The big growth of the Arabic internet content in the last years has raised up the need for an effective
stemming techniques for Arabic language. Arabic stemming algorithms can be ranked, according to three
category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statistical approach
(ex. N-Garm). However, no stemming of this language is perfect: The existing stemmers have a low
efficiency. In this paper, we introduce a new stemming technique for Arabic words that also solve the
problem of the plural form of irregular nouns in Arabic language, which called broken plural. The
proposed stem extractor provides very accurate results in comparisons with other algorithms.
Consequently the search effectiveness improved.
Automatic Phonetization-based Statistical Linguistic Study of Standard ArabicCSCJournals
Statistical studies based on automatic phonetic transcription of Standard Arabic texts are rare, and even though studies have been performed, they have been done only on one level - phoneme or syllable - and the results cannot be generalized on the language as a whole. In this paper we automatically derived accurate statistical information about phonemes, allophones, syllables, and allosyllables in Standard Arabic. A corpus of more than 5 million words, including words and sentences from both Classical Arabic and Modern Standard Arabic, has been prepared and preprocessed. We developed a software package to accomplish a rule-based automatic transcription from written Standard Arabic text to the corresponding linguistic units at four levels: phoneme, allophone, syllable, and allosyllable. After testing the software on four corpora including more than 57000 vocabulary words, and achieving a very high accuracy (> 99 %) on the four levels, we used this software as a reliable tool for the automatic transcription of the corpus used in this paper and evaluated the following: 1) the vocabulary phonemes, allophones, syllables, and allosyllables with their specific percentages in Standard Arabic. 2) the best curve equation from the distribution of phonemes, allophones, syllables, and allosyllables normalized frequencies. 3) important statistical information, such as percentage of consonants and vowels, percentage of the consonants classified by the place and way of articulation, the transition probability matrix between phonemes, and percentages of syllables according to the type of syllable, etc.
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
This paper is not to compare between the Arabic language and the English language as natural languages.
Instead, it focuses on the comparison among them in terms of their challenges in building text based
Conversational Agents (CAs). A CA is an intelligent computer program that used to handle conversations
among the user and the machine. Nowadays, CAs can play an important role in many aspects as this work
figured. In this paper, different approaches that can be used to build a CA will be differentiated. In each
approach, the comparison aspects among the Arabic and English languages will be debated with the
respect to the Arabic language.
Using automated lexical resources in arabic sentence subjectivityijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
10th International Conference on Artificial Intelligence and Applications (AI 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Artificial Intelligence and its applications. The Conference looks for significant contributions to all major fields of the Artificial Intelligence, Soft Computing in theoretical and practical aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
May 2024 - Top 10 Read Articles in Artificial Intelligence and Applications (...gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)gerogepatton
3rd International Conference on Artificial Intelligence Advances (AIAD 2024) will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the area advanced Artificial Intelligence. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the research area. Core areas of AI and advanced multi-disciplinary and its applications will be covered during the conferences.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Information Extraction from Product Labels: A Machine Vision Approachgerogepatton
This research tackles the challenge of manual data extraction from product labels by employing a blend of
computer vision and Natural Language Processing (NLP). We introduce an enhanced model that combines
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) in a Convolutional
Recurrent Neural Network (CRNN) for reliable text recognition. Our model is further refined by
incorporating the Tesseract OCR engine, enhancing its applicability in Optical Character Recognition
(OCR) tasks. The methodology is augmented by NLP techniques and extended through the Open Food
Facts API (Application Programming Interface) for database population and text-only label prediction.
The CRNN model is trained on encoded labels and evaluated for accuracy on a dedicated test set.
Importantly, our approach enables visually impaired individuals to access essential information on
product labels, such as directions and ingredients. Overall, the study highlights the efficacy of deep
learning and OCR in automating label extraction and recognition.
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
10th International Conference on Artificial Intelligence and Applications (AI 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Artificial Intelligence and its applications. The Conference looks for significant contributions to all major fields of the Artificial Intelligence, Soft Computing in theoretical and practical aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Research on Fuzzy C- Clustering Recursive Genetic Algorithm based on Cloud Co...gerogepatton
Aiming at the problems of poor local search ability and precocious convergence of fuzzy C-cluster
recursive genetic algorithm (FOLD++), a new fuzzy C-cluster recursive genetic algorithm based on
Bayesian function adaptation search (TS) was proposed by incorporating the idea of Bayesian function
adaptation search into fuzzy C-cluster recursive genetic algorithm. The new algorithm combines the
advantages of FOLD++ and TS. In the early stage of optimization, fuzzy C-cluster recursive genetic
algorithm is used to get a good initial value, and the individual extreme value pbest is put into Bayesian
function adaptation table. In the late stage of optimization, when the searching ability of fuzzy C-cluster
recursive genetic is weakened, the short term memory function of Bayesian function adaptation table in
Bayesian function adaptation search algorithm is utilized. Make it jump out of the local optimal solution,
and allow bad solutions to be accepted during the search. The improved algorithm is applied to function
optimization, and the simulation results show that the calculation accuracy and stability of the algorithm
are improved, and the effectiveness of the improved algorithm is verified
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
10th International Conference on Artificial Intelligence and Soft Computing (...gerogepatton
10th International Conference on Artificial Intelligence and Soft Computing (AIS 2024) will
provide an excellent international forum for sharing knowledge and results in theory, methodology, and
applications of Artificial Intelligence, Soft Computing. The Conference looks for significant
contributions to all major fields of the Artificial Intelligence, Soft Computing in theoretical and practical
aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from
both academia as well as industry to meet and share cutting-edge development in the field.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Employee attrition refers to the decrease in staff numbers within an organization due to various reasons.
As it has a negative impact on long-term growth objectives and workplace productivity, firms have
recognized it as a significant concern. To address this issue, organizations are increasingly turning to
machine-learning approaches to forecast employee attrition rates. This topic has gained significant
attention from researchers, especially in recent times. Several studies have applied various machinelearning methods to predict employee attrition, producing different resultsdepending on the employed
methods, factors, and datasets. However, there has been no comprehensive comparative review of multiple
studies applying machine-learning models to predict employee attrition to date. Therefore, this study aims
to fill this gap by providing an overview of research conducted on applying machine learning to predict
employee attrition from 2019 to February 2024. A literature review of relevant studies was conducted,
summarized, and classified. Most studies agree on conducting comparative experiments with multiple
predictive models to determine the most effective one.From this literature survey, the RF algorithm and
XGB ensemble method are repeatedly the best-performing, outperforming many other algorithms.
Additionally, the application of deep learning to employee attrition prediction issues also shows promise.
While there are discrepancies in the datasets used in previous studies, it is notable that the dataset
provided by IBM is the most widely utilized. This study serves as a concise review for new researchers,
facilitating their understanding of the primary techniques employed in predicting employee attrition and
highlighting recent research trends in this field. Furthermore, it provides organizations with insight into
the prominent factors affecting employee attrition, as identified by studies, enabling them to implement
solutions aimed at reducing attrition rates.
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
10th International Conference on Artificial Intelligence and Applications (AIFU 2024) is a forum for presenting new advances and research results in the fields of Artificial Intelligence. The conference will bring together leading researchers, engineers and scientists in the domain of interest from around the world. The scope of the conference covers all theoretical and practical aspects of the Artificial Intelligence.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
THE TRANSFORMATION RISK-BENEFIT MODEL OF ARTIFICIAL INTELLIGENCE:BALANCING RI...gerogepatton
This paper summarizes the most cogent advantages and risks associated with Artificial Intelligence from an
in-depth review of the literature. Then the authors synthesize the salient risk-related models currently being
used in AI, technology and business-related scenarios. Next, in view of an updated context of AI along with
theories and models reviewed and expanded constructs, the writers propose a new framework called “The
Transformation Risk-Benefit Model of Artificial Intelligence” to address the increasing fears and levels of
AIrisk. Using the model characteristics, the article emphasizes practical and innovative solutions where
benefitsoutweigh risks and three use cases in healthcare, climate change/environment and cyber security to
illustrate unique interplay of principles, dimensions and processes of this powerful AI transformational
model.
13th International Conference on Software Engineering and Applications (SEA 2...gerogepatton
13th International Conference on Software Engineering and Applications (SEA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Software Engineering and Applications. The goal of this conference is to bring together researchers and practitioners from academia and industry to focus on understanding Modern software engineering concepts and establishing new collaborations in these areas.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
PHP Frameworks: I want to break free (IPC Berlin 2024)
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Translation
1. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
DOI : 10.5121/ijaia.2020.11107 79
CONSTRUCTION OF AMHARIC-ARABIC
PARALLEL TEXT CORPUS FOR NEURAL
MACHINE TRANSLATION
Ibrahim Gashaw and H L Shashirekha
Mangalore University, Department of Computer Science, Mangalagangotri,
Mangalore-574199
ABSTRACT
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long Short-
Term Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
KEYWORDS
Amharic, Arabic, Parallel Text Corpus, Neural Machine Translation, OpenNMT
1. INTRODUCTION
Construction of a parallel corpus is very challenging and needs a high cost of human expertise.
Machine Translation (MT), the task of translating texts from one natural language to another
natural language automatically, is an important application of Computational Linguistics (CL)
and Natural Language Processing (NLP). It can produce high-quality translation results based on
a massive amount of aligned parallel text corpora in both the source and target languages [1]. MT
systems need resources that can provide an interpretation/suggestion of the source text and a
translation hypothesis. Parallel corpus consists of parallel text that can promptly locate all the
occurrences of one language expression to another language expression and is one of the
significant resources that could be utilized for MT tasks. [2].
The overall process of invention, innovation, and diffusion of technology related to language
translation drive the increasing rate of the MT industry rapidly [3]. The number of Language
Service Provider (LSP) companies offering varying degrees of translation, interpretation,
localization, language, and social coaching solutions are rising in accordance with the MT
industry [3]. Today many applications such as Google Translate and Microsoft Translator are
available online for language translations.
There are different types of MT approaches, and many researchers have classified them in
different ways [4] [5] [6]. Based on the rules and/or model used to translate from one language to
another, MT approaches can be classified into Statistical Machine Translation (SMT), Rulebased
2. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
80
Machine Translation (RBMT), Hybrid Machine Translation (HMT), and Neural Machine
Translation (NMT). SMT uses a statistical model based on the analysis of large volumes of
parallel text that determines the correspondence between a word from the source language and a
word from the target language. RBMT uses grammatical rules that conduct a syntactic analysis of
the source language and the target language to generate the translated sentence. However, RBMT
requires extensive proof/reading and is heavily dependent on lexicons and domain experts. A
hybrid approach is a combination of the statistical method and the rule-based approach, which
includes a word-based model, a phrase-based model, a syntax-based model, and forest-based
model. NMT is a type of MT that depends on neural network models to develop statistical models
for translation.
Deep learning NMT approach is a recent approach of MT that produces high-quality translation
results based on a massive amount of aligned parallel text corpora in both the source and target
languages. Deep learning is part of a broader family of ML methods based on Artificial Neural
Networks (ANN) [7]. It allows computational models that are composed of multiple processing
layers to learn representations of data with various levels of abstraction. These methods have
improved the state-of-the-art research in language translation [8]. NMT is one of the deep
learning end-to-end learning approaches to MT that uses a large ANN to predict the likelihood of
a sequence of words, typically modeling entire sentences in a single integrated model. The
advantage of this approach is that a single system can be trained directly on the source and target
text no longer requiring the pipeline of specialized systems used in statistical MT. Many
companies, such as Google, Facebook, and Microsoft, are already using NMT technology [9].
NMT has recently shown promising results on multiple language pairs. Nowadays, it is widely
used to solve translation problems for many language pairs. However, much of the research on
this area has focused on European languages despite these languages being very rich in resources.
Several MT systems have been developed, particularly from English to other natural languages,
such as Arabic, Germany, Chinese, French, Hindi, Japanese, Spanish, and Urdu [1]. Though a
limited amount of work has been done in different Ethiopian languages in the field of NLP, the
MT system for Amharic-Arabic language pair is still in its infancy due to lack of parallel corpora.
Therefore, it is essential to construct Amharic-Arabic parallel text corpora, which is very much
required to develop the Amharic to Arabic NMT system.
Amharic language is the national language of Ethiopia spoken by 26.9% of Ethiopia’s population
as mother tongue and spoken by many people in Israel, Egypt, and Sweden. Arabic is a Semitic
language spoken by 250 million people in 21 countries as the first language and serving as a
second language in some Islamic countries. Ethiopia is one of the nations, which have more than
33.3% of the population who follow Islam, and they use the Arabic language to teach religion and
for communication purposes. Both of these languages belong to the Semitic family of languages,
where the words are formed by modifying the root itself internally and not merely by the
concatenation of affixes to word roots [10].
NMT has many challenges, such as; domain mismatch, size of training data, rare words, long
sentences, word alignment, and beam search [11] depending on the language pairs. Some of these
challenges are addressed in this paper.
2. TRANSLATION CHALLENGES OF AMHARIC AND ARABIC LANGUAGES
Amharic and Arabic Languages are characterized by complex, productive morphology, with a
basic word-formation mechanism, root-and-pattern. The root is a sequence of consonants, and the
pattern is a sequence of Vowels (V) and Consonants (C) with open slots in it. It combines with
the pattern through a process called interdigitating (intercalation): each letter of the root (radical)
3. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
81
fills a slot in the pattern. For example, the Amharic root s.b.r (sabr) denoting a notion of breaking,
combines with the pattern CVCC (the slots and vowels are denoted by C and V respectively)
[12].
In addition to the unique root-and-pattern morphology, they are characterized by a productive
system of more standard affixation processes. These include prefixes, suffixes, infixes, and
circumfixes, which are involved in both inflectional and derivational processes. Consider the
Arabic word " وف وس ه يكتوبن " (wasawf yaktwubunahu) and its English translation "and they
will write it". A possible analysis of these complex words defines the stem as "aktub" (write),
with an inflectional circumfix, "y-uwna", denoting third person masculine plural, an inflectional
suffix, "ha" (it), and two prefixes, "sa" (will) and "wa" (and). Morphological analysis of words in
a text is the first stage of most natural language applications. Morphological processes define the
shape of words. They are usually classified
into two types of processes [13];
1. A derivational process that deals with word-formation; such methods can create new
words from existing ones, potentially changing the category of the original word. For
example, from the Arabic root " ب كت " (wrote), the following words are derived; "
"ب الكآتي (the writer), " آب الكيت " (the book), “ ة المكتب ” (the library), “ ه مكتب ”
(library). The same is true in Amharic also. For example, from Amharic root “ፃ ፈ ” ” (he
wrote), the following words are derived; "ፀ ሀ ፊ " (writer), "መፅ ሀ ፍ " (the book),
"ቤተ መፅ ሃ ፍ " (the library).
2. Inflectional processes are usually highly productive, applying to most members of a
particular word class. For example, Amharic nouns inflect for number, so most nouns
occur in two forms, singular (which is considered in the citation form) and plural,
regularly obtained by adding the suffix "ዎ ች " to the base form. This process makes the
translation ambiguous. The considerable number of potential types of words and the
difficulty of handling out-of-lexicon items (in particular, proper names) combined with
prefix or suffix makes the computation very challenging. For example, in the word
"aysäbramm", the prefix "ay" and the suffix "amm" (he doesn't break) are out-oflexicon
items.
The main lexical challenge in building NLP systems for Amharic and Arabic languages is the
lack of MRD which are vital resources. Absence of capitalization in Arabic and Amharic
languages makes it hard to identify proper nouns, titles, acronyms, and abbreviations. Sentences
in the Arabic language are usually long, and punctuation has no or little effect on the
interpretation of the text.
Standard preprocessing techniques such as capitalization, annotation, and normalization cannot be
performed on Amharic and Arabic languages due to issues of orthography. A single token in
these languages can be a sequence of more than one lexical item, and hence be associated with a
sequence of tags. For example, the Amharic word "አ ስ ፈ ” ረ ደ ች ብኝ " ("asferedachibegn"), when
translated to English will be "a case she initiated against me was decided in her favor". The word
is built from the causative prefix "as" (causes), a perfect stem "ferede" (judged), a subject maker
clitics "achi" (she), a benefactive marker "b" (against) and the object pronoun "egn" (I).
Contextual analysis is essential in both languages to understand the exact meaning of some
words. For example, in Amharic, the word "ገ ና " can have the meaning of "Christmas holiday" or
"waiting for something until it happens." Diacritics (vowels) are most of the time, omitted from
the Arabic text, which makes it hard to infer the word meaning and complex morphological rules
should be used to tokenize and parse the text. The corpus of the Arabic language has a bias
4. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
82
towards religious terminology as a relatively high frequency of religious terms and phrases are
found. Characters are sometimes stretched for justified text which hinders the exact match for the
same word. Synonyms are very common in Arabic. For example, “year” has three synonyms
ام عس, ول حس, نه سس , and all are widely used in everyday communication.
Discretization is defined as a symbol over and underscored letters, which are used to indicate the
proper pronunciations as well as for disambiguation purposes. Its absence in Arabic texts poses a
real challenge for Arabic NLP, as well as for translation, leading to high ambiguity. Though the
use of discretization is significant for readability and understanding, they don’t appear in most
printed media in Arabic regions nor on Arabic Internet web sites. They are visible in the Quran,
which is fully discretized to prevent misinterpretation [10].
3. RELATED WORKS
Various attempts have been made in the past to construct parallel text corpus for translation tasks
from Amharic and Arabic languages to the other languages and translation models. Some of these
works are as follows;
Tesfaye Bayu [14], describe the acquisition, preprocessing, segmentation, and alignment of
Amharic-English parallel corpus that consists of 1,45,820 Amharic-English parallel sentences
(segments) from various sources. This corpus is larger in size than previously compiled Amharic-
English Bilingual Corpus (AEBC), which is hosted by European Language Resource Association
with the size of 13,379 aligned segments (sentences) and the Low Resource Languages for
Emergent Incidents (LORELEI) developed by Strassel and Tracey [15], that contains 60,884
segments.
Sakre et al., [16], presents a technique that aimed to construct an Arabic-English corpus
automatically through web mining. The system crawled the host using GNU Wget in order to
obtain English and Arabic web pages and then created candidate parallel pairs of documents by
filtering them according to their similarity in the path, file name, creation date, and length.
Finally, the technique measured the parallelism similarity between the candidate pairs according
to the number of transnational tokens found between an English paragraph and its three Arabic
neighbor paragraphs. However, in this work, they did not test or compare different models of
statistical translation using the constructed parallel corpus.
Ahmad et al. [17] reported the construction of one million words English-Arabic Political Parallel
Corpus (EAPPC) that consists of 351 English and Arabic original documents and their
translations. The texts were meta-annotated, segmented, tokenized, English-Arabic aligned,
stemmed, and POS-tagged. Arabic and English data were extracted from the official website of
'His Majesty King Abdullah II' and from 'His book' and then reprocessed from metadata
annotation to alignment. They built a parallel concordancer, which consisted of two parts: the
application through which the end-user interacts with the corpus and the database which stores
the parallel corpus. Experiments carried out examined the translation strategies used in rendering
a culture-specific term, and results demonstrated the ease with which knowledge about translation
strategies can be gained from this parallel corpus.
Inoue et al. [18], describe the creation of Arabic-Japanese portion of a parallel corpus of
translated news articles, which is collected at Tokyo University of Foreign Studies (TUFS). Part
of the corpus is manually aligned at the sentence level for development and testing. The
firstresults of Arabic-Japanese phrase-based MT trained on this corpus reported a BLEU score of
11.48.
5. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
83
Alotaibi [19], described an ongoing project at the College of Languages and Translation, King
Saudi University, to compile a 10-million-word Arabic-English parallel corpus to be used as a
resource for translation training and language teaching. The corpus has been manually verified at
different stages, including translations, text segmentation, alignment, and file preparation, to
enhance its quality. The corpus is available in XML format and through a user-friendly web
interface, which includes a concordance that supports bilingual search queries and several
filtering options.
All the above-related works use different lexical resources like Machine Readable Dictionaries
(MRD) or parallel corpora, for probability assessment or translation. However, when there is a
lack of such a lexical resource, an alternative approach should be available [20]. Nowadays, NMT
Models are widely used to solve various translation problems. Learning Phrase Representations
using Recurrent Neural Network (RNN) Encoder-Decoder for statistical MT [21] benefits more
natural language-related applications as it can capture the linguistic regularities in multiple word
level as well as phrase level. But it is limited to target phrases, instead of using a phrase table.
Dzmitry et al. [22], extended NMT Encoder-Decoder that encodes a source sentence into a fixed-
length vector, which is used by a decoder to generate a translation. It automatically search for
relevant parts of a source sentence to predict a target word without having to form these parts like
a hard segment explicitly. Their method yielded good results on longer sentences, and the
alignment mechanisms are jointly trained towards a better log-probability of producing correct
translations that need high computational cost.
A. Almahairi et al. [23], proposed NMT for the task of Arabic translation in both directions
(Arabic-English and English-Arabic) and compared a Vanilla Attention-based NMT system
against a Vanilla Phrase-based system. Preprocessing Arabic texts can increase the performance
of the system, especially normalization, but the model consumes much time for training.
4. CONSTRUCTION OF AMHARIC ARABIC PARALLEL TEXT CORPUS
As Amharic-Arabic parallel text corpora are not available for MT tasks, we have constructed a
small parallel text corpus by modifying the existing monolingual Arabic and its equivalent
translation of Amharic language text corpora available on Tanzile [24]. Quran text corpus
consists of 114 chapters, 6236 verses, and 157935 words. The organization of the Quran text is
categorized into verses (sequence of sentences and phrases). A sample verse in Arabic and its
equivalent translation in Amharic and English is shown in Table 1. A total number of 13,501
Amharic-Arabic parallel sentences corpora have been constructed to train the Amharic to Arabic
NMT system by splitting the verses manually into separate sentences of Amharic language as a
source sentence and Arabic language as a target sentence as shown in Table 2. The total size of
the corpus is 3.2MB, and it is split into training (80%) and test (20%).
In Arabic text corpus, as the first verse of each chapter of Quran start with " ِْﻢسِب ِهسﱠلال ِن َٰمْح ﱠالر ِن ِيﻢ ِح ﱠالر
" (in the name of Allah the most gracious and the most merciful), it is split into a separate line.
However, in case of Amharic corpus, the equivalent translation of this sentence is placed only in
the first chapter. Therefore, it is added before the first line of the equivalent translated Amharic
verses.
6. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
84
Table 1. Sample Verse of Quran in Arabic and its equivalent translation in Amharic and English
7. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
85
Table 2. Split Sentences of Quran Chapter 2 Verse number 282
5. EXPERIMENTS AND RESULTS
In this work, we adopted openNMT Attention-based Encoder-Decoder architecture to construct
Amharic Arabic NMT model, because attention mechanisms are being progressively used to
enhance the performance of NMT by selectively focusing on sub-parts of the sentence during
translation [25]. As described in [26], "NMT takes a conditional language modeling view of
translation by modeling the probability of a target sentence w1:T given a source sentence x1:S as
P(wt:T∣x)=Π1 t P(wt∣w1:t−1,x). This distribution is estimated using an Attention-basedEncoder-
Decoder architecture". Two special kinds of Recurrent Neural Network (RNN) LSTM and GRU
which are capable of learning long-term dependencies are used in this research work. RNN is a
type of neural network for sequential data that can remember its inputs due to an internal memory
which is more suited for machine learning problems. It can produce predictive results in
sequential data that the information cycles through a loop when it makes a decision. It takes into
consideration the current inputs and also previously received inputs, which is learned earlier [27].
8. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
86
LSTM was first introduced by S. Hochreiter and J. Schmidhuber [28], to avoid the long-term
dependency problem. LSTM inherit the exact architecture from standard RNNs, with the
exception of the hidden state. The memory in LSTMs (called cells) takes as input the previous
state and the current input. Internally, these cells decide what to keep in and what to eliminate
from the memory. Then, they combine the previous state, the current memory, and the input.
LSTM calculates a hidden state ht as;
where t, i, f, o, W, U are called the time step, input gate, forget gate, output gate, the recurrent
connection at the previous and current hidden layer, and the weight matrix connecting the inputs
to the current hidden layers respectively. ~
Ct is a "candidate" hidden state that is computed based
on the current input and the previous hidden state. C is the internal memory of the unit. GRU
extends LSTM with a gating network generating signals that act to control how the present input
and previous memory work to update the current activation, and thereby the current network
state. Gates are themselves weighted and are selectively updated [29]. For GRU, the hidden state
ht is computed as;
where, ~ht is activation function r is a reset gate, and z is an update gate. Both LSTM and GRU
are designed to resolve the vanishing gradient problem which prevents standard RNNs from
learning long-term dependencies through gating mechanism.
A basic form of NMT consists of two components; an encoder which computes a representation
of source sentence S and a decoder which generates one target word at a time and hence
decomposes the conditional probability [25]. The Attention-based Encoder-Decoder architecture
used for Amharic-Arabic NMT is shown in Figure 1.
In general, the proposed model works as follows:
1. 1.Reads the input words one by one to obtain a vector representation from every encoder
time step using LSTM/GRU based encoder
2. 2.Provide the encoder representation to the decoder
3. 3.Extract the output words one by one using another LSTM/GRU based decoder that is
conditioned on the selected inputs from the encoder hidden state of each time step.
9. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
87
Figure 1: Attention based Encoder-Decoder architecture for Amharic-Arabic NMT
With this setting, the model is able to selectively focus on useful parts of the input sequence and
hence, learn the alignment (matching segments of original text with their corresponding segments
of the translation).
As shown in Figure 1, Using LSTM Attention based Encoder-Decoder, each word from the
source sentence is associated with a vector wϵRd
and will be transformed into [w0, w1, w2, w3,w4]
ϵRRdx5
by the encoder, and then an LSTM over this sequence of vectors is computed. This will be
the encoder representation (attention weights) e = [e0, e1, e2, e3, e4]. The attention
weights and word vectors of each time step is fed to another LSTM cell to compute the context
vector which is computed as:
where, g is a transformative function that outputs a vocabulary size vector. A soft-max is then
applied to st to maximize it to a vector of probability ptϵRV
. Each entry of pt will measure how
Figure 1: Attention based Encoder-Decoder architecture for Amharic-Arabic NMT LSTM1 likely
is each word in the vocabulary and the highest probability pt is taken as it = argmax(pt) and and
corresponding vector wit −1=wit
The attention or context vector Ct is computed at each decoding step first with the function
f(ht-1, et′) → αt′ϵR and then a score for each hidden state et′ of the encoder is computed. The
sequence of αt′ is normalized using a soft-max and Ct′ is computed as the weighted average
of et′ as;
The same procedure is also applied for GRU based NMT.
10. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
88
Preprocessing of both Amharic and Arabic scripts will have a great impact on the performance of
the NMT system. Sentences are split and aligned manually and then all punctuation marks are
removed from texts. After extensive experiments, the maximum source and target sequence
length are set to 44, maximum batch size for training and validation is set to 80 and 40
respectively and learning rate to 0.001 with Adam optimization for both LSTM and GRU RNN
type. The remaining parameters are used as default. The system saves the model for each of
10,000 training samples and then computes accuracy and perplexity of each model 10 times.
Perplexity is a measure of how easily a probability distribution (the model) predicts the next
sequence. A low perplexity indicates that the translation model is good at predicting/translating
the test set. PP(W)=P(w1w2......wn)−1n
Table 3 and Table 4 shows LSTM-base and GRU-based NMT evaluation, where, ”Val ppl”, ”Val.
Accuracy”, ”Av. pred score” and ”Pred ppl” represents Validation Perplexity, Validation
Accuracy, Average Prediction Score and Prediction Perplexity respectively. The result indicate
that LSTM-based NMT outperforms GRU-based NMT. Since this is the first experiment done on
Amharic and Arabic parallel text corpus we consider it as a good performance with small size
corpus.
Table 3. LSTM-based NMT Evaluation
Table 4. GRU-based NMT Evaluation
The models are evaluated using Bilingual Evaluation Understudy (BLEU), which is a score for
comparing a candidate translation of the text with reference translations. The primary
programming task for a BLEU implementer is to compare n-grams of the candidate with ngrams
11. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
89
of the reference translation and count the number of matches which are position independent.
More the matches, better the candidate translation is. BLEU is inexpensive to calculate and it is
quick to use. It is expressed as the following equation [30];
where Pn is an n-gram precision that uses n-grams up to length N and positive weights wϵRn that
sum to one.
We also compared the two recurrent units LSTM and GRU based OpenNMT translation
algorithm with Google Translation System which is a free multilingual translation system
developed by Google to translate multilingual text [9] and the results are shown in Figure 2.
LSTM based OpenNMT outperforms over GRU based OpenNMT and Google Translation
system, which is BLEU score of 12%, 11%, and 6% respectively.
Figure 2: Best BLEU-Scores of LSTM and GRU based OpenNMT translation and Google Translation
System
6. CONCLUSION
Many researchers have investigated to solve translation problems for many language pairs and
NMT has recently shown promising results on multiple language pairs. However, much of the
research on this area has focused on European languages despite these languages being very rich
Figure 2: Best BLEU-Scores of LSTM and GRU based OpenNMT translation and Google
Translation System in resources. Since Amharic and Arabic languages lack parallel corpus for the
purpose of developing NMT, small size Amharic-Arabic parallel text corpora have been
constructed to train the Amharic to Arabic NMT system by splitting the verses manually into
separate sentences of Amharic language as a source sentence and Arabic language as a target
sentence. Using the constructed corpus LSTM-based and GRU-based NMT models are developed
and evaluated using BLEU. The results are also compared with Google Translation system. Since
this is the first experiment done on Amharic and Arabic parallel text corpus, we consider it as a
good performance for small size corpus. Extensive experiments with a large amount of training
data could be implemented for better performance.
REFERENCES
[1] S. Islam, A. Paul, B. S. Purkayastha, and I. Hussain, “CONSTRUCTION OF ENGLISH-BODO
PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRANSLATION,” Int. J. Nat.
Lang. Comput. Vol, vol. 7, 2018.
[2] L. Rura, W. Vandeweghe, and M. Montero Perez, “Designing a parallel corpus as a multifunctional
translator’s aid,” in XVIII FIT World Congress= XVIIIe Congrès mondial de la FIT, 2008.
12. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
90
[3] B. Maylath, “Current trends in translation,” Commun. Lang. Work, vol. 2, no. 2, pp. 41–50, 2013.
[4] J. Oladosu, A. Esan, I. Adeyanju, B. Adegoke, O. Olaniyan, and B. Omodunbi, “Approaches to
Machine Translation: A Review,” FUOYE J. Eng. Technol., vol. 1, pp. 120–126, 2016.
[5] T. O. Daybelge, “Improving the precision of example-based machine translation by learning from
user feedback,” bilkent university, 2007.
[6] S. Ghosh, S. Thamke, and others, “Translation of Telugu-Marathi and vice-versa using rule based
machine Translation,” arXiv Prepr. arXiv1406.3969, 2014.
[7] MemoQ, “5 Translation Technology Trends to Watch Out for in 2019,” 2019. [Online]. Available:
https://slator.com/press-releases/5-translation-technology-trends-to-watch-out-in- 2019/.
[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
[9] Y. Wu et al., “Google’s neural machine translation system: Bridging the gap between human and
machine translation,” arXiv Prepr. arXiv1609.08144, 2016.
[10] H. L. Shashirekha and I. Gashaw, “DICTIONARY BASED AMHARIC-ARABIC CROSS
LANGUAGE INFORMATION RETRIEVAL,” Comput. Sci. & Inf. Technol., vol. 6, pp. 49–60,
2016.
[11] P. Koehn and R. Knowles, “Six challenges for neural machine translation,” arXiv Prepr.
arXiv1706.03872, 2017.
[12] M. Wordofa, “Semantic Indexing and Document Clustering for Amharic Information Retrieval,”
AAU, 2013.
[13] M. El-Haj, “Multi-document arabic text summarisation,” University of Essex, 2012.
[14] Gezmu Andargachew Mekonnen Andreas Nurnberger and T. B. Bati, “A Parallel Corpus for
Amharic-English Machine Translation,” 2018.
[15] S. M. Strassel and J. Tracey, “LORELEI Language Packs: Data, Tools, and Resources for
Technology Development in Low Resource Languages.,” in LREC, 2016
[16] M. M. Sakre, M. M. Kouta, and A. M. N. Allam, “automated construction of Arabic-English parallel
corpus,” J. Adv. Comput. Sci., vol. 3, 2009.
[17] A. A.-S. Ahmad, B. Hammo, and S. Yagi, “ENGLISH-ARABIC POLITICAL PARALLEL
CORPUS: CONSTRUCTION, ANALYSIS AND A CASE STUDY IN TRANSLATION
STRATEGIES,” Jordanian J. Comput. Inf. Technol., vol. 3, no. 3, 2017.
[18] G. Inoue, N. Habash, Y. Matsumoto, and H. Aoyama, “A Parallel Corpus of Arabic-Japanese News
Articles,” in Proceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC-2018), 2018.
[19] H. M. Alotaibi, “Arabic-English Parallel Corpus: A New Resource for Translation Training and
Language Teaching,” Arab World English J. Vol., vol. 8, 2017.
[20] P. Resnik and N. A. Smith, “The web as a parallel corpus,” Comput. Linguist., vol. 29, no. 3, pp.
349–380, 2003.
[21] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine
translation,” arXiv Prepr. arXiv1406.1078, vol. 3, 2014.
13. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.1, January 2020
91
[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and
translate,” arXiv Prepr. arXiv1409.0473, vol. 7, 2014.
[23] A. Almahairi, K. Cho, N. Habash, and A. Courville, “First result on Arabic neural machine
translation,” arXiv Prepr. arXiv1606.02680, vol. 1, 2016.
[24] J. Tiedemann, “Parallel Data, Tools and Interfaces in OPUS.,” in Lrec, 2012, vol. 2012, pp. 2214–
2218.
[25] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural
machine translation,” arXiv Prepr. arXiv1508.04025, 2015.
[26] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “Opennmt: Open-source toolkit for neural
machine translation,” arXiv Prepr. arXiv1701.02810, 2017.
[27] N. Donges, “Recurrent Neural Networks and LSTM,” 2018. [Online]. Available:
https://towardsdatascience.com/recurrent-neural-networks-and-lstm-4b601dd822a5.
[28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp.
1735–1780, 1997.
[29] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural
networks on sequence modeling,” arXiv Prepr. arXiv1412.3555, 2014.
[30] K. Wolk and K. Marasek, “Neural-based machine translation for medical text domain. Based on
European Medicines Agency leaflet texts,” Procedia Comput. Sci., vol. 64, pp. 2–9, 2015.
AUTHORS
Ibrahim Gashaw Kassa is a Ph.D. candidate at Mangalore University Karnataka State,
India, since 2016. He graduated in 2006 in Information System from Addis Ababa
University, Ethiopia. In 2014, he obtained his master’s degree in Information
Technology from the University of Gondar, Ethiopia., and he serves as a lecturer at the
University of Gondar from 2009 to May 2016. His research interest is in Cross-
Language Information Retrieval, Machine translation Artificial Intelligence Natural
Language Processing.
Dr. H L Shashirekha is a Professor in the Department of Computer Science,
Mangalore University, Mangalore, Karnataka State, India. She completed her M.Sc.
in Computer Science in 1992 and Ph.D. in 2010 from University of Mysore. She is a
member of Board of Studies and Board of Examiners (PG) in Computer Science,
Mangalore University. She has several papers in International Conferences and
published several papers in International Journals and Conference Proceedings. Her
area of research includes Text Mining and Natural Language Processing.