People across the globe have access to materials such as journals, articles, adverts etc. via the internet. However
many of these resources come in diverse nature of languages. Although, English language seems most suitable to
most people, some readers do believe that working on materials in one’s native language is more enjoyable than in
other languages. Researches have shown that Arabic language has not been prominent in terms of online materials
and the few existing are most times ignored due to the peculiar nature of its various characters and constructs.
Hence, a proper study of its relationship with English language with a view to bringing people closer to its
understanding becomes necessary. The system scenarios were modeled and implemented using Unified Modeling
Language and Microsoft C# respectively in a way that the expected set of characters of the language of interest was
automatically formed with respect to a given input. The procedural steps were properly followed in the development
and running of the code using Context-Free Rule Based Technique with the availability of hardware required as
clearly described in the design. The system’s workability was tested with different source texts as inputs and in each
case the resulting outputs were very effective with respect to the translation process. The design here is expected to
serve as a tool for assisting beginners in these two languages and so, showcases a one-to-one form of
correspondence, hence, more rules and functions for ensuring a more robust are expected in future works.
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with
each other for different using various languages. Machine translation is highly demanding due to increasing the usage of web
based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small
or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding
whether a word is a proper noun or not. Here we have introduce a new approach to identify proper noun from a Bengali sentence
for UNL without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion
to UNL and vice versa with minimal storing word in dictionary.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from
different country communicate with each other for different using various languages. Machine
translation is highly demanding due to increasing the usage of web based Communication. One of
the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we
do not have concept of small or capital letters and there is huge no. of different naming entity
available in Bangla. Thus we find difficulties in understanding whether a word is a naming word
(proper noun) or not. Here we have introduced a new approach to identify naming word from a
Bengali sentence for UNL without storing huge no. of naming entity in word dictionary. The goal is
to make possible Bangla sentence conversion to UNL and vice versa with minimal storing word in
dictionary.
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...Syeful Islam
Abstract—Now-a-days hundreds of millions of people of
almost all levels of education and attitudes from different
country communicate with each other for different
purposes and perform their jobs on internet or other
communication medium using various languages. Not all
people know all language; therefore it is very difficult to
communicate or works on various languages. In this
situation the computer scientist introduce various inter
language translation program (Machine translation). UNL
is such kind of inter language translation program. One of
the major problem of UNL is identified a name from a
sentence, which is relatively simple in English language,
because such entities start with a capital letter. In Bangla
we do not have concept of small or capital letters. Thus
we find difficulties in understanding whether a word is a
proper noun or not. Here we have proposed analysis rules
to identify proper noun from a sentence and established
post converter which translate the name entity from
Bangla to UNL. The goal is to make possible Bangla
sentence conversion to UNL and vice versa. UNL system
prove that the theoretical analysis of our proposed system
able to identify proper noun from Bangla sentence and
produce relative Universal word for UNL.
Spotting The Difference–Machine Versus Human TranslationUlatus
Regardless of how much the systems have improved and made worldwide communication easier, there is still no alternative to human translation. Machines can only comply to grammatical accuracy, but the semantic, linguistic, and the cultural completeness in a text can only be achieved by human speakers
Hidden markov model based part of speech tagger for sinhala languageijnlc
In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics analysis and automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which words are inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on statistical based approach, in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus, where the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could reach more than 90% of accuracy for known words.
Translation, transcription and interpretationlee shin
the slide shows some of the basic difference and concepts about translation(http://www.waterstonetranslations.com), transcription and interpretation.
to know more visit the site http://qualitytran.blogspot.in/2015/08/comparison-between-translators-and.html
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with
each other for different using various languages. Machine translation is highly demanding due to increasing the usage of web
based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small
or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding
whether a word is a proper noun or not. Here we have introduce a new approach to identify proper noun from a Bengali sentence
for UNL without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion
to UNL and vice versa with minimal storing word in dictionary.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from
different country communicate with each other for different using various languages. Machine
translation is highly demanding due to increasing the usage of web based Communication. One of
the major problem of Bengali translation is identified a naming word from a sentence, which is
relatively simple in English language, because such entities start with a capital letter. In Bangla we
do not have concept of small or capital letters and there is huge no. of different naming entity
available in Bangla. Thus we find difficulties in understanding whether a word is a naming word
(proper noun) or not. Here we have introduced a new approach to identify naming word from a
Bengali sentence for UNL without storing huge no. of naming entity in word dictionary. The goal is
to make possible Bangla sentence conversion to UNL and vice versa with minimal storing word in
dictionary.
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...Syeful Islam
Abstract—Now-a-days hundreds of millions of people of
almost all levels of education and attitudes from different
country communicate with each other for different
purposes and perform their jobs on internet or other
communication medium using various languages. Not all
people know all language; therefore it is very difficult to
communicate or works on various languages. In this
situation the computer scientist introduce various inter
language translation program (Machine translation). UNL
is such kind of inter language translation program. One of
the major problem of UNL is identified a name from a
sentence, which is relatively simple in English language,
because such entities start with a capital letter. In Bangla
we do not have concept of small or capital letters. Thus
we find difficulties in understanding whether a word is a
proper noun or not. Here we have proposed analysis rules
to identify proper noun from a sentence and established
post converter which translate the name entity from
Bangla to UNL. The goal is to make possible Bangla
sentence conversion to UNL and vice versa. UNL system
prove that the theoretical analysis of our proposed system
able to identify proper noun from Bangla sentence and
produce relative Universal word for UNL.
Spotting The Difference–Machine Versus Human TranslationUlatus
Regardless of how much the systems have improved and made worldwide communication easier, there is still no alternative to human translation. Machines can only comply to grammatical accuracy, but the semantic, linguistic, and the cultural completeness in a text can only be achieved by human speakers
Hidden markov model based part of speech tagger for sinhala languageijnlc
In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics analysis and automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which words are inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on statistical based approach, in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus, where the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could reach more than 90% of accuracy for known words.
Translation, transcription and interpretationlee shin
the slide shows some of the basic difference and concepts about translation(http://www.waterstonetranslations.com), transcription and interpretation.
to know more visit the site http://qualitytran.blogspot.in/2015/08/comparison-between-translators-and.html
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Presented at code4lib 2015, February 10th, in Portland, OR.
“If you have something to say, then say it in code…” - Sebastian Hammer,
code4lib 2009
In its 10 year run, code4lib has covered the spectrum of libtech
development, from search to repositories to interfaces. However, during
this time there has been little discussion about this one little fact
about development - code does not exist in a vacuum.
Like the comment above, code has something to say. A person’s or
organization’s culture and beliefs influences code in all steps of the
development cycle. What development method you use, tools, programming
languages, licenses - everything is interconnected with and influenced
by the philosophies, economics, social structures, and cultural beliefs
of the developer and their organization/community.
This talk will discuss these interconnections and influences when one
develops code for libraries, focusing on several development practices
(such as “Fail Fast, Fail Often” and Agile) and licensing choices (such
as open source) that libtech has either tried to model or incorporate
into mainstream libtech practices. It’ll only scratch the surface of the
many influences present in libtech development, but it will give folks a
starting point to further investigate these connections at their own
organizations and as a community as a whole.
tl;dr - this will be a messy theoretical talk about technology and
libraries. No shiny code slides, no live demos. You might come out of
this talk feeling uncomfortable. Your code does not exist in a vacuum.
Then again, you don’t exist in a vacuum either.
A selected bibliography of the talk can be viewed at http://is.gd/c4l15yoose.
I am a lecturer in English at Khawaja Fared Govt. College Rahim Yar Khan. Here is my humble effort to discuss How to choose variety or code in multilingual society.
Welcome! Accessible reference for a diverse communityJennifer Arnott
Panel presentation (15 minutes) for the New England Archivists conference in October 2016. Discusses accessible reference approaches for librarians and archivists (particular reference to accessibility for visual impairment).
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
This paper is not to compare between the Arabic language and the English language as natural languages.
Instead, it focuses on the comparison among them in terms of their challenges in building text based
Conversational Agents (CAs). A CA is an intelligent computer program that used to handle conversations
among the user and the machine. Nowadays, CAs can play an important role in many aspects as this work
figured. In this paper, different approaches that can be used to build a CA will be differentiated. In each
approach, the comparison aspects among the Arabic and English languages will be debated with the
respect to the Arabic language.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
Building
dialogues systems
interaction
has recently gained considerable
attention, but most of the
resourc
es and systems built so far are
tailored to
English and other Indo
-
European languages. The need
for designing
systems for
other languages is increasing such as Arabic language.
For this reasons, there
are more int
erest for Arabic dialogue acts classification
task because it
a key player in Arabic language
under
standing
to
bu
ilding this systems
.
This paper surveys
different techniques
for dialogue acts classification
for Arabic.
W
e describe the
main existing techniques for utterances segmentations and
classification, annotation schemas, and
test corpora for Arabic
dialogues understanding
that have introduced
in the literature
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
Presented at code4lib 2015, February 10th, in Portland, OR.
“If you have something to say, then say it in code…” - Sebastian Hammer,
code4lib 2009
In its 10 year run, code4lib has covered the spectrum of libtech
development, from search to repositories to interfaces. However, during
this time there has been little discussion about this one little fact
about development - code does not exist in a vacuum.
Like the comment above, code has something to say. A person’s or
organization’s culture and beliefs influences code in all steps of the
development cycle. What development method you use, tools, programming
languages, licenses - everything is interconnected with and influenced
by the philosophies, economics, social structures, and cultural beliefs
of the developer and their organization/community.
This talk will discuss these interconnections and influences when one
develops code for libraries, focusing on several development practices
(such as “Fail Fast, Fail Often” and Agile) and licensing choices (such
as open source) that libtech has either tried to model or incorporate
into mainstream libtech practices. It’ll only scratch the surface of the
many influences present in libtech development, but it will give folks a
starting point to further investigate these connections at their own
organizations and as a community as a whole.
tl;dr - this will be a messy theoretical talk about technology and
libraries. No shiny code slides, no live demos. You might come out of
this talk feeling uncomfortable. Your code does not exist in a vacuum.
Then again, you don’t exist in a vacuum either.
A selected bibliography of the talk can be viewed at http://is.gd/c4l15yoose.
I am a lecturer in English at Khawaja Fared Govt. College Rahim Yar Khan. Here is my humble effort to discuss How to choose variety or code in multilingual society.
Welcome! Accessible reference for a diverse communityJennifer Arnott
Panel presentation (15 minutes) for the New England Archivists conference in October 2016. Discusses accessible reference approaches for librarians and archivists (particular reference to accessibility for visual impairment).
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
This paper is not to compare between the Arabic language and the English language as natural languages.
Instead, it focuses on the comparison among them in terms of their challenges in building text based
Conversational Agents (CAs). A CA is an intelligent computer program that used to handle conversations
among the user and the machine. Nowadays, CAs can play an important role in many aspects as this work
figured. In this paper, different approaches that can be used to build a CA will be differentiated. In each
approach, the comparison aspects among the Arabic and English languages will be debated with the
respect to the Arabic language.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation
result based on huge amount of aligned parallel text corpora in both the source and target languages.
Although Bodo is a recognized natural language of India and co-official languages of Assam, still the
machine readable information of Bodo language is very low. Therefore, to expand the computerized
information of the language, English to Bodo SMT system has been developed. But this paper mainly
focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using
Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator
tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora.
Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques
in the SMT system.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat
services. As a result, a text containing various English and Bangla words and phrases has become
increasingly common. Existing transliteration tools display poor performance for such texts. This paper
proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English
words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of
dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it
significantly outperforms the benchmark transliteration tool.
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation Systemijtsrd
Sadhu and Cholit bhasha are two significant Bangladeshi languages. Sadhu was functional in ancient era and had Sanskrit components but in present era cholit took its place. There are many formal and legal paper works present in Sadhu language which direly need to be translated in Cholit because its more favorable and speaker friendly. Therefore, this paper dealt with this issue by familiarizing the current era with Sadhu by creating a software. Different sentences were chosen and final data set was obtained by Principal Component Analysis PCA . MATLAB and Python are used for different machine learning algorithms. Most work is being done using Scikit Learn and MATLAB machine learning toolbox. It was found that Linear Discriminant Analysis LDA functions best. Speed prediction was also done and values were determined through graphs. It was inferred that this categorizer efficiently translated all Sadhu words to Cholit precisely and in well structured way. Therefore, Sadhu will not remain a complex language in this decade. Nakib Aman Turzo | Pritom Sarker | Biplob Kumar "Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-3 , April 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30792.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30792/interpretation-of-sadhu-into-cholit-bhasha-by-cataloguing-and-translation-system/nakib-aman-turzo
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...IJECEIAES
Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
Design and Development of Morphological Analyzer for Tigrigna Verbs using Hyb...kevig
Morphological analyzer is the basic for various high level NLP applications such as information retrieval, spell checking, grammar checking, machine translation, speech recognition, POS tagging and automatic sentence construction. This paper is carefully designed for design and analysis of morphological analyzer Tigrigna verbs using hybrid of memory learning and rules based approaches. The experiment have conducted using python 3 where TiMBL algorithms IB2 and TRIBL2, and Finite State Transducer rules are used. The performance of the system has been evaluated using 10 fold cross validation technique. Testing conducted using optimized parameter settings for regular verbs and linguistic rules of the Tigrigna language allomorph and phonology for the irregular verbs. The accuracy on the memory based approach with optimized parameters of TiMBL algorithm IB2 and TRIBL2 was 93.24% and 92.31%, respectively. Finally, the hybrid approach had the actual performance of 95.6% using linguistic rules for handling irregular and copula verbs.
Syracuse UniversitySURFACEThe School of Information Studie.docxdeanmtaylor1545
Syracuse University
SURFACE
The School of Information Studies Faculty
Scholarship
School of Information Studies (iSchool)
2001
Natural Language Processing
Elizabeth D. Liddy
Syracuse University, [email protected]
Follow this and additional works at: http://surface.syr.edu/istpub
Part of the Library and Information Science Commons, and the Linguistics Commons
This Book Chapter is brought to you for free and open access by the School of Information Studies (iSchool) at SURFACE. It has been accepted for
inclusion in The School of Information Studies Faculty Scholarship by an authorized administrator of SURFACE. For more information, please contact
[email protected]
Recommended Citation
Liddy, E.D. 2001. Natural Language Processing. In Encyclopedia of Library and Information Science, 2nd Ed. NY. Marcel Decker, Inc.
http://surface.syr.edu?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/ischool?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://surface.syr.edu/istpub?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://network.bepress.com/hgg/discipline/1018?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
http://network.bepress.com/hgg/discipline/371?utm_source=surface.syr.edu%2Fistpub%2F63&utm_medium=PDF&utm_campaign=PDFCoverPages
mailto:[email protected]
Natural Language Processing
1
INTRODUCTION
Natural Language Processing (NLP) is the computerized approach to analyzing text that
is based on both a set of theories and a set of technologies. And, being a very active area
of research and development, there is not a single agreed-upon definition that would
satisfy everyone, but there are some aspects, which would be part of any knowledgeable
person’s definition. The definition I offer is:
Definition: Natural Language Processing is a theoretically motivated range of
computational techniques for analyzing and representing naturally occurring texts
at one or more levels of linguistic analysis for the purpose of achieving human-like
language processing for a range of tasks or applications.
Several elements of this definition can be further detailed. Firstly the imprecise notion of
‘range of computational techniques’ is necessary because there are multiple methods or
techniques from which to choose to accomplish a particular type of language analysis.
‘Naturally occurring texts’ can be of any language, mode, genre, etc. The texts can be
oral or written. The only requirement is that they be in a language used by humans to
communicate to one another. Also, the text being analyzed should not be specifically
constru.
EXTENDING A MODEL FOR ONTOLOGY-BASED ARABIC-ENGLISH MACHINE TRANSLATION (NAN) ijaia
The acceleration in telecommunication needs leads to many groups of research, especially in communication facilitating and Machine Translation fields. While people contact with others having different languages and cultures, they need to have instant translations. However, the available instant translators are still providing somewhat bad Arabic-English Translations, for instance when translating books or articles, the meaning is not totally accurate. Therefore, using the semantic web techniques to deal with the homographs and homonyms semantically, the aim of this research is to extend a model for the ontology-based Arabic-English Machine Translation, named NAN, which simulate the human way in translation. The experimental results show that NAN translation is approximately more similar to the Human Translation than the other instant translators. The resulted translation will help getting the translated texts in the target language somewhat correctly and semantically more similar to human translations for the Non-Arabic Natives and the Non-English natives.
DESIGN AND DEVELOPMENT OF MORPHOLOGICAL ANALYZER FOR TIGRIGNA VERBS USING HYB...kevig
Morphological analyzer is the base for various high-level NLP applications such as information retrieval,
spell checking, grammar checking, machine translation, speech recognition, POS tagging and automatic
sentence construction. This paper is carefully designed for design and analysis of morphological analyzer
Tigrigna verbs using hybrid of memory learning and rules based approaches. The experiment have
conducted using Python 3 where TiMBL algorithms IB2 and TRIBL2, and Finite State Transducer rules
are used. The performance of the system has been evaluated using 10 fold cross validation technique.
Testing was conducted using optimized parameter settings for regular verbs and linguistic rules of the
Tigrigna language allomorph and phonology for the irregular verbs. The accuracy of the memory based
approach with optimized parameters of TiMBL algorithm IB2 and TRIBL2 was 93.24% and 92.31%,
respectively. Finally, the hybrid approach had an actual performance of 95.6% using linguistic rules for
handling irregular and copula verbs.
Working without Words: The Methods of Translating Open Access Technological E...Ekrema Shehab
This is a corpus study which demonstrates the difficulties translators encounter when they translate into a target language without an established terminology in the field in question. The purpose of this study is twofold: First, it examines existing methods of translating specialized terminology in technology advertisements/commercials based on three main parameters, namely circulation, recurrence, and audience type. Second, the study proposes certain methods that can be effectively used to render open access specialized technological texts into Arabic for non-specialized audiences. The surveyed texts consist of translations of seventy five of the most visited online website service advertisements. This paper reveals that the text appeal is to be maintained in translation by securing the uninterrupted flow of communication between the service provider and the customer reading the translation. Conformity to the conventions of open access commercial texts and the functionality of those texts remain the main controllers in translating such types of texts.
Similar to Design and Implementation of a Language Assistant for English – Arabic Texts (20)
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Design and Implementation of a Language Assistant for English – Arabic Texts
1. Design and Implementation of A Language Assistant
For English – Arabic Texts
A. A. Alabi
Computer Science Department,
College of Technology,
The Polytechnic,
Ile-Ife, Osun, State, Nigeria.
newalabikompurer@gmail.com
O. A. Ayilara
Computer Science Department,
College of Natural Sciences,
Oduduwa University,
Ipetumodu, Ile-Ife, Nigeria
Talk2tobi2004@gmail.com
K. O. Jimoh
Computer Engineering Department
Federal Polytechnic, Ede,
Osun State, Nigeria
oyewumikudirat@gmail.com
ABSTRACT
People across the globe have access to materials such as journals, articles, adverts etc. via the internet. However
many of these resources come in diverse nature of languages. Although, English language seems most suitable to
most people, some readers do believe that working on materials in one’s native language is more enjoyable than in
other languages. Researches have shown that Arabic language has not been prominent in terms of online materials
and the few existing are most times ignored due to the peculiar nature of its various characters and constructs.
Hence, a proper study of its relationship with English language with a view to bringing people closer to its
understanding becomes necessary. The system scenarios were modeled and implemented using Unified Modeling
Language and Microsoft C# respectively in a way that the expected set of characters of the language of interest was
automatically formed with respect to a given input. The procedural steps were properly followed in the development
and running of the code using Context-Free Rule Based Technique with the availability of hardware required as
clearly described in the design. The system’s workability was tested with different source texts as inputs and in each
case the resulting outputs were very effective with respect to the translation process. The design here is expected to
serve as a tool for assisting beginners in these two languages and so, showcases a one-to-one form of
correspondence, hence, more rules and functions for ensuring a more robust are expected in future works.
Keywords: English language, Arabic language, Language Assistant, Machine Translation
1. INTRODUCTION
People across the globe have access to multiple forms of relevant materials such as texts, journals, articles, adverts
and websites via the internet. However not all these resources (especially the text-based) are in the languages most
users understand. Many of these resources come in diverse nature of languages ranging from English language (US
and/or British) which is a popular language, French language, Arabic language and a host of others. Many people
today believe that reading and working on materials in one’s native language seems more enjoyable than other forms
of languages thereby calling for the need to translate texts into languages of interest.
Translation as a process, is an activity comprising of the interpretation of the meaning of a text in one language and
the production of a new, equivalent text in another; thus, ensuring that both the source and the target texts
communicate the same message while taking into account a number constraints such as context, the grammar of the
source language, its writing conventions, idioms and the likes. The real challenge in this isn’t only about translating
the text(s) of a language into equivalent text(s) of another language but the generation of meanings instantaneously
and accurately. In real sense, No automated system is meant to replace human translation system but could help in
some areas where necessary. For instance, a properly implemented system could assist those seeking to understand
certain text materials which aren’t written in the language(s) they understand for important decisions making and
action taking. Designers in Machine Translation (MT) mostly care about the provision of the meaning of a particular
input text in a general form to the user when designing systems. These systems (i.e. Translators) come in different
forms; for example, one type functions to provide assistance and help to human translators during translation
processes, such as the help in linguistic rules and language grammar. Another type is the automatic system that
attempts to directly translate sentences and texts without any human intervention while the third is designed to
operate according to different laid down algorithms and translation policies with respect to the contents and contexts
of the translation. The key item of interest in this translation mechanism is “languages” which across the globe are
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
211 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. made up constructs derived or created out of syntaxes of different classes. This is simply because of the differences
in grammars of the world existing languages. Some languages seem relatively simple and easy to learn to some
people while others look difficult to some extent. Meanwhile, information in some languages considered difficult are
rarely accessible on net by people; a condition that tends to portray these languages as irrelevant.
Although, some readers and information users have access to many texts, papers, articles and websites over the
internet, not all of these texts are in languages of interest, thereby making some people either ignore or neglect some
of these materials. Many even find some of these language texts difficult to pronounce and so, making it necessary
to get or find the exact translator which could either be a person very familiar with the language or software to do a
form of translation into target language before they could understand anything. A good example is Arabic language
which most people usually ignore any time they come across while sourcing for information due to its peculiar
nature of its various characters and constructs. This, we know could amount to serious waste of time and resources
and also loss of credible information in some cases. People do find it extremely uneasy to either source or make
reference to Arabic texts due to some reasons and so, making the language less relevant on most of our online
platforms. The salient issue here is that, some of the ignored contents could also be of relevance and could probably
be the only means of writing and reading especially while dealing with people with only knowledge of Arabic
language. The fact still remains that people (i.e. Non-Arabs) have a need to understand Arabic, and Arabs need to
understand other languages for handling meaningful materials over the internet, all by an automated system and at
no cost. Many non-Arabs today are faced with the need to understand Arabic for the existence of interactive
communication in matters of interest. This Arabic language seems very special in terms of its lexis and syntax thus
posing serious problems in the area of correspondence for people who do not understand the language. On the other
hand, some Arabs (mostly the beginners) sometimes find it difficult to pronounce English text the exact way English
speaking people do. What happens in most cases is that the Arabic way (tone) does come into play such whenever
people in this regard tend to pronounce letters in English language. Affected people in this regards do improvise by
employing a third party for assistance which in most cases tends to turn private information into a public type.
The truth is that no translator can implicitly take care of the syntactic relationships between English and Arabic
language because of this peculiar nature of Arabic language but the burden here could drastically be minimized by
creating a form of facilitator that could serve as a bridge between the languages in question. The strategy here is to
find a form of computer-based language (Rule-Based) tool that could assist in generating any form of English texts
into their respective Arabic - based form in a very fast and easy manner of response. This shall increase people with
little or no knowledge of the Arabic language in developing interest in the language thus creating a way out for those
who want to start working on materials written in such a language. It could also serve as a language assistant for
Arabs and others Arabic speakers (people whose sole communication tool is Arabic Language) thus improving their
pronunciation of English texts.
2. RELATED WORK
Machine Translation is about the translation of natural languages (Albat & Fritz, 2012); an area which has recently
witnessed series of developments. However, lots of works have been pondered in this area of Natural Language
Processing; a phenomenon also referred to as Machine Translation (MT). For instance, Nadkarni et al (2011)
provided a brief description of common machine learning approaches that are been used for diverse Natural
Language Processing (NPL) sub-problems. They also discussed how modern NPL architectures are designed with a
summary of the Apache Foundation Unstructured Information Management Architecture.
Carbonell et al (1981), developed a tool for addressing the several translation problems with examples of English-to-
Spanish and English-to-Russian translations. In this, the source was first analyzed and mapped into a language-free
conceptual representation with an inference mechanism for the application of contextual world knowledge about
items that were only implicit in the input text. The final step then involves the mapping of appropriate sections of the
language-free representation into the target language by the natural language generator. Another design was carried
out as regards to this but only for the extraction of molecular pathways from journal articles (Friedman et al, 2001).
Corresponding results from this demonstrated the value of the underlying techniques for the purpose of acquiring
valuable knowledge from biological journals. In another development, the use and efficiency of Statistical Machine
Translation (SMT) has been lauded then in a model by Shwenk et al (2008) comprising of the Open-Source Moses
decoder, the integration of a bilingual dictionary and a continuous space target language model for a general purpose
French/English statistical machine translation system. Other forms involved the introduction of a Unified Neural
Network Architecture and Learning Algorithm by Collobert et al (2011). This technique was discovered useful in
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
212 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. various Natural Language Processing tasks including part-of-speech tagging, chunking, named entity recognition
and the likes. This system learnt internal representations on the basis of vast amounts of mostly unlabeled training
data instead of exploiting man-made input features and was said to be a basis for building a freely available tagging
system with good performance and minimal computational requirements. William et al (2011), also introduced the
Multiscale Geometric Multi-Resolution Analysis (GMRA) for handling the investigation of detection, measurement
and modeling techniques to exploit low-dimensional intrinsic structures with a view to improve procedure such as
machine learning. The results obtained showed that the approximation error of the GMRA is completely
independent of the ambient dimension; thus establishing GMRA as a provably fast algorithm for dictionary learning
with approximation and guarantees. Folajimi and Isaac (2012) came up a tool for understanding Yoruba Language
using Statistical Machine Translation (SMT). The software employs a machine translation paradigm where
translations are generated on the basis of statistical models whose parameters are derived from the analysis of
bilingual text corpora. It was discovered that SMT seems to be a veritable tool for translating between English
Language and Yoruba Language because of the non-existence of parallel corpus between the two languages. In
their work (Kaufman et al, 2016), a technique known as Generic Notions of Complexity for two dominant
frameworks was introduced as an improvement on the stochastic multi-armed bandit model which proved
significantly positive. Also, Maggioni (2016), extended work on Multiscale Geometric Multi-Resolution Analysis
(GMRA) application-wise using a Non-Asymptotic Bounds and Robustness procedure (Maggioni, 2016).
However, several other research works were carried out and the trend still evolves on a daily basis. This is as a result
of the different forms of scope embedded in most available publications. While certain people argue on the need for
totally new designs, some prefer improving on the techniques demonstrated in existing publications. An example of
this was the study to determine the relationship between grammar efficacy and grammar performance among Arabic
learners on aspects such as Correction of grammar errors, Vocalization of words, and Construction of sentences
through questionnaire and it was observed that a moderate exist correlation between grammar efficacy and grammar
performance with efficacy of sentence construction as the most noticeable result (Mustapha, 2017).
3. SYSTEM DESIGN
3.1 Methodology
The work here work is designed to replace the involvement of humans in the translation of English (source
language) to Arabic (target language). In most cases, the system needs to emulate the thinking strategy as humans
do during translation. The various characters sets making up the alphabets in both English and Arabic languages
(fig. 1) are to be identified and analyzed. A database of the English alphabets is to be created and made to represent
the set of characters representing the source with those of Arabic language taken in to consideration. Also, the
syntaxes of both languages are to be made reference to with the aid of in-built tools and some other forms of
program segments when needed in terms of their relationship character-wise. The various scenarios in the design of
the system were modeled using Unified Modeling Language (UML). For instance, the modeling section involving
the Use-Case diagram (figure 2) has to with the demonstrations of the relationship among the various entities
making up the system in terms of functions and possibly dependencies. A logical way of representing this user-data
relationship is also shown in the class diagram (figure 3). Another aspect of this modeling includes the procedural
flow among the various class objects involved in the translation mechanism. The said scenario is diagrammatically
illustrated using the Activity diagram (figure 4).
The program is designed in such a way that the expected set of characters (word, sentence etc.) of the
language of interest (target language) is automatically formed with respect to a call (an instruction code) by the user
with English language as the source language and Arabic language as the target language. This was made possible
with the aid of Microsoft C#. So many programming languages were considered in the cause of designing the
system. A lot of factors were put into consideration which includes database access, data transmission via networks,
database security, database retrieval, multi user network access, data capture, etc. The choice “Microsoft C# (C
sharp)” programming language was made to achieve the above set of objectives. Microsoft C# programming
language is a user friendly platform that gives room for the design of an interface that can be modified
programmatically. The language has the advantage of easy development, flexibility, and it has the ability of
providing the developer/programmer with possible hints and it produces a beautiful graphical interface.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
213 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. Send in a text in a language
(i.e. English)
Generate the Arabic equivalent
of the sent text
View the Arabic form of the
English text
Figure 2: Use-Case Diagram
Figure 1: English - Arabic Alphabets
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
214 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
6. 3.2 Input Design and Output Specification
It is necessary to denote that data inputted in the computer for processing determines what the output
usually is. Screen designs are generally or basically made for data entry or capture. With English text as input, the
output section of the system in this research work is designed to automatically display its result in Arabic language
(based on the input supplied); i.e. generate response immediately after the input is received by the system. However,
the nature of the output strictly depends on that of the input as well as the correctness of the code implementation
with respect to the rules governing the writing and transformation of both character and words in the concerned
languages.
3.3 Program Design and Specification
The system’s structure, as shown in the flowchart (fig. 5) has the general form as a task divided into several sub-
tasks, which come together to give the solution to the problem with the translation process as the core stage (fig. 6).
The program is designed with the specification of having two languages modules namely English Language and
Arabic Language.
Figure 5: System Flowchart
Start
Translate
Source Language
Target Language
Translate
More?
Stop
Yes
No
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
216 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
7. Figure 6: Translation Process
Input Text (English)
English
I- Rules
ENGLISH DICTIONARY
ARABIC DICTIONARY
Output Text (Arabic)
Arabic
I-Rules
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
217 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
8. 4. SYSTEM IMPLEMENTATION AND EVALUATION
The procedural steps were properly followed in the development and running of the code with the availability of
hardware required as clearly described in the design section. The resulting outputs were very effective with respect
to the translation process. The system’s workability was tested with three different source texts as inputs and in each
case, the result came out accurately. For example, the first three sets of English characters (alphabets a, b and c)
where the first point of call for translation .This is captured in figure 7; with its first part (upper part) showing the
inputs (in English) while the second part (lower part) displays results (in Arabic). Sufficed to that was the translation
of some phrases as shown in figures 8and 9. The various displays in the translation exercise also came up with the
processing speed in each of the instances.
Figure 7: Character Translation
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
218 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
9. Figure 8: Translation: phrase 1
Figure 9: Translation Phrase 2
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
219 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
10. 5. FURTHER ISSUES AND
CONCLUSION
The parameters of interest in this study as demonstrated in the various results obtained show that the system
was simply meant to handle the text-to-text conversion between English language and Arabic language. This
showcases the form of representation one expects as regards to the Arabic form of a given English text. However,
some improvements are clearly necessary to increase both the quality of the study scenario with a view to addressing
the possible meaning of the Arabic form of any given English text and possibly make it speech based. This is
simply because of the draconian nature of spellings and pronunciations involved in Arabic language. To achieve
this, a robust system that takes care of both texts as well as the voice part is required. This shall take care of the
targeted goal in future designs as regards to this area of Natural Language Processing.
The developed system proved very efficient with respect to its outputs and the translation was very accurate
within reasonable time frame. This represents a useful tool for assisting those seeking redress in a form of assistance
in the lexical understanding of texts between English language and Arabic language. Understanding how characters
in English are written in their Arabic and vice-versa shall promote a form of familiarity between the two languages
thus assisting stakeholders in not just writing of letters but knowing the exact of any character thus enhancing how
certain letters are to be pronounced in their respective languages. The different tasks accomplished in the process of
designing and implementing the system were properly displayed in the various figures shown and their needed
explanations included accordingly. The design here showcases a translation form considered as “One-to-One
Correspondence” and so, some future works are still required as stated earlier for the incorporation of more rules as
well as functions for ensuring a more robust and more useful form of language translation.
In theory, one is expected to get increased in comprehending a particular language, therefore, getting use to
the Arabic form of English letters is expected to create a form of familiarity that could assist Arabs who sometimes
find it uneasy to either write or pronounce such texts. This could also be a tool for beginners in Arabic language.
6. REFERENCES
[1] Albat A. and Fritz T., (2012) . Systems and Methods for Automatically Estimating a Translation Tie.
US Patient 0185235.
[2] Carbonell J. G., Cullingford R.E. and Gershman A. V., (1981). Steps Towards Knowledge-Based Machine
Translation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. PAM 1-3, No. 4.
[3] Collobert R., Weston J., Bottou L., Karlen M., Kavukcuoglu K. and Kulksa P., (2011). Natural Language
Processing (Almost) From Scratch. Journal of Machine Learning Research. 12, 2493-2537
[4] Folajimi Y. O. and Isaac O., (2012). Using Statistical Machine Translation (SMT) as a Language
Translation Tool for Understanding Yoruba Language. EIE Second International Conference; Computer,
Energy Net., Robotics and Telecom. eleCon2012.
[5] Friedman C, Kra P., Yu H., Krauthmmer M. and Rzhetsky A., (2001). GENIES: A Natural-Language
Processing System for the Extraction of Molecular Pathways from Journal Articles. BIOINFORMATICS.
Oxford University Press. Vol. 17, Suppl.1.
[6] Kaufman E., Cappe O. and Garivier A., (2016). On The Complexity of Best-Arm Identification in Multi-
Armed Bandit Models. Journal of Machine Learning Research. 17, 1-42.
[7] Maggioni M., Minsker S. and Strawn N., (2016). Multiscale Dictionary Learning: Non-Asymptotic
Bounds and Robustness. Journal of Machine Learning Research. 17.
[8] Mustapha N. F. (2017). Grammar Efficacy and Grammar Performance: An Exploratory Study on
Arabic Learners. Mediterranean Journal of Social Sciences, Vol 8 No 4.
[9] Nadkarni P., Ohno-Machado L and Chapman W. (2011). Natural Language Processing: An Introduction
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
220 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
11. Journal of The American Medical Informatics Association.; 18:544-551. doi: 101136.
[10] Shwenk H. Fouet J. and Senellart J., (2008). First Step towards a general-purpose French-English Statistical
Machine Translation System Proceedings of the third Workshop on Statistical Machine Translation
pages 119-122 Columbus.
[11] William K., Chen G. and Maggioni M., (2011). Multiscale Geometric Method for Data Sets II:
Geometric Analysis. Cornell University Library, New York, USA
.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 16, No. 5, May 2018
221 https://sites.google.com/site/ijcsis/
ISSN 1947-5500