The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
This document summarizes research on models for classifying Arabic text. It discusses how text classification can organize large amounts of Arabic text by categorizing documents. The document reviews several studies that have applied algorithms like Naive Bayes, K-Nearest Neighbor, and Support Vector Machines to classify Arabic texts with accuracies ranging from 62.7% to 91%. It also outlines some of the linguistic challenges of classifying Arabic, which has a complex orthography compared to languages like English. Finally, it provides a brief overview of common text classification techniques like preprocessing, feature extraction, evaluation, and the machine learning vs rule-based approaches.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Summer Research Project (Anusaaraka) ReportAnwar Jameel
This document discusses Anusaaraka, a machine translation tool being developed to translate between English and Hindi. It uses principles from Panini's grammar to map word groups and constructions between the languages. Where differences exist, extra notation is added to preserve source language information. The output is presented in layers to show the translation process. It aims to bridge the language barrier by allowing users to access text in their preferred Indian language.
Key features include faithfully representing the source text, reversibility of the translation process through layered output, and transparency by allowing users to trace the translation steps. It was developed by combining traditional Indian linguistic principles with modern technologies.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
This document summarizes research on models for classifying Arabic text. It discusses how text classification can organize large amounts of Arabic text by categorizing documents. The document reviews several studies that have applied algorithms like Naive Bayes, K-Nearest Neighbor, and Support Vector Machines to classify Arabic texts with accuracies ranging from 62.7% to 91%. It also outlines some of the linguistic challenges of classifying Arabic, which has a complex orthography compared to languages like English. Finally, it provides a brief overview of common text classification techniques like preprocessing, feature extraction, evaluation, and the machine learning vs rule-based approaches.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
Summer Research Project (Anusaaraka) ReportAnwar Jameel
This document discusses Anusaaraka, a machine translation tool being developed to translate between English and Hindi. It uses principles from Panini's grammar to map word groups and constructions between the languages. Where differences exist, extra notation is added to preserve source language information. The output is presented in layers to show the translation process. It aims to bridge the language barrier by allowing users to access text in their preferred Indian language.
Key features include faithfully representing the source text, reversibility of the translation process through layered output, and transparency by allowing users to trace the translation steps. It was developed by combining traditional Indian linguistic principles with modern technologies.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
This document presents a technique for identifying scripts (Tamil, English, Hindi) and numerals from multilingual document images using a rule-based classifier. Words are segmented and the first character of each word is represented as a 9-bit vector based on features like density, shape, and transitions. A rule-based classifier containing rules derived from training data is used to classify the script of each character. The technique aims to automatically categorize multilingual documents before applying optical character recognition and requires minimal preprocessing with high accuracy.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
Rule-based machine translation systems were evaluated based on errors in translations from English to Persian. Several error categories were identified including syntactic errors (word order, missing words, parts of speech), unknown words, and semantic errors (incorrect words, idiomatic expressions). Three texts (a short story, user guide, and magazine article) were translated using two machine translation systems and analyzed sentence-by-sentence to identify errors according to the defined categories.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Machine translation with statistical approachvini89
Machine translation systems aim to automatically translate text from one human language to another. Statistical machine translation approaches view every sentence in one language as a potential translation of any sentence in the other language. They assign a probability to every pair of sentences to indicate how likely one is a translation of the other. Statistical machine translation considers individual sentences without context and has many acceptable translations for each, with the choice being a matter of taste.
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...CSCJournals
- The document describes the development of a bi-directional English to Yoruba translator for real-time mobile chatting.
- It uses a hybrid machine translation approach combining rule-based and corpus-based methods. Parallel corpora and computational rules were developed and stored in databases. Dictionaries containing words in both languages and their parts of speech were also created.
- The system was implemented using PHP for server-side scripting, JSON for data interchange, and an Android app for the user interface. Testing showed 95% accuracy compared to Google Translate.
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
A large amount of data in Indian languages stored digitally is in ASCII-based font formats. ASCII has 128
character-set, therefore it is unable to represent all the characters necessary to deal with the variety of
scripts available worldwide. Moreover, these ASCII-based fonts are not based on a single standard
mapping between the character-codes and the individual characters, for a particular Indian script, unlike
the English language fonts based on the standard ASCII mapping. Therefore, it is required that the fonts for
a particular script must be available on the system to accurately represent the data in that script. Also, the
conversion of data in one font into another is a difficult task. The non-standard ASCII-based fonts also pose
problems in performing search on texts in Indian languages available over web. There are 25 official
languages in India, and the amount of digital text available in ASCII-based fonts is much larger than the
text available in the standard ISCII (Indian Script Code for Information Interchange) or Unicode formats.
This paper discusses the work done in the field of font-detection (to identify the font of the given text) and
font-converters (to convert the ASCII-format text into the corresponding Unicode text).
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
The document discusses different approaches to machine translation, including rule-based, statistical, example-based, and dictionary-based approaches. It provides details on each approach, such as rule-based methods using linguistic rules and extensive lexicons, statistical methods relying on probabilistic models trained on parallel texts, example-based methods translating by analogy to examples in aligned corpora, and dictionary-based methods translating words directly with or without morphological analysis. The document also compares transfer-based and interlingual rule-based machine translation, noting interlingual methods aim to represent the source text independently of languages.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
1) The document discusses the history and evolution of machine translation from human translation to current automatic systems. It covers early rule-based, statistical, and example-based systems as well as current hybrid approaches.
2) Key challenges in machine translation include accurately understanding syntax, semantics, idioms and context in different languages. Early systems lacked fluency or made exceptions.
3) The future aims for fully automatic translation between any texts or speech with full meaning conveyed, allowing improved global communication and cultural understanding.
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
Mediterranean Arabic Language and Speech Technology ResourcesHend Al-Khalifa
The MEDAR survey collected responses from 57 players involved in human language technologies and language resources for Arabic. The responses came from a variety of countries, with the highest numbers coming from Egypt (11), Morocco (10), and West Bank & Gaza Strip (9). The majority of respondents (36) answered on behalf of institutions, while 17 answered as independent experts. The survey gathered information on the respondents' profiles, language resources, needs for resources, and market information to understand the current state of language technologies for Arabic.
A SURVEY OF MARKOV CHAIN MODELS IN LINGUISTICS APPLICATIONScsandit
Markov chain theory isan important tool in applied probability that is quite useful in modeling real-world computing applications.For a long time, rresearchers have used Markov chains for data modeling in a wide range of applications that belong to different fields such as computational linguists, image processing, communications,bioinformatics, finance systems, etc. This paper explores the Markov chain theory and its extension hidden Markov models (HMM) in natural language processing (NLP) applications. This paper also presents some aspects related to Markov chains and HMM such as creating transition matrices, calculating data sequence probabilities, and extracting the hidden states.
ALTERNATIVES TO BETWEENNESS CENTRALITY: A MEASURE OF CORRELATION COEFFICIENTcsandit
In this paper, we measure and analyze the correlation of betweenness centrality (BWC) to five centrality measures, including eigenvector centrality (EVC), degree centrality DEG),
clustering coefficient centrality (CCC), farness centrality (FRC), and closeness centrality(CLC). We simulate the evolution of random networks and small-world networks to test the correlation between BWC and the five measures. Additionally, nine real-world networks are also involved in our present study to further examine the correlation. We find that DEG is
highly correlated to BWC on most cases and can serve as alternative to computationallyexpensive BWC. Moreover, EVC, CLC and FRC are also good candidates to replace BWC on
random networks. Although it is not a perfect correlation for all the real-world networks, there still exists a relatively good correlation between BWC and other three measures (CLC, FRC and EVC) on some networks. Our findings in this paper can help us understand how BWC correlates to other centrality measures and when to decide a good alternative to BWC
We present a framework that combines machine learnt classifiers and taxonomies of topics to enable a more conceptual analysis of a corpus than can be accomplished using Vector Space Models and Latent Dirichlet Allocation based topic models which represent documents purely
in terms of words. Given a corpus and a taxonomy of topics, we learn a classifier per topic and annotate each document with the topics covered by it. The distribution of topics in the corpus can then be visualized as a function of the attributes of the documents. We apply this framework to the US State of the Union and presidential election speeches to observe how topics such as jobs and employment have evolved from being relatively unimportant to being the most discussed topic. We show that our framework is better than Vector Space Models and an Latent Dirichlet Allocation based topic model for performing certain kinds of analysis.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
This document presents a technique for identifying scripts (Tamil, English, Hindi) and numerals from multilingual document images using a rule-based classifier. Words are segmented and the first character of each word is represented as a 9-bit vector based on features like density, shape, and transitions. A rule-based classifier containing rules derived from training data is used to classify the script of each character. The technique aims to automatically categorize multilingual documents before applying optical character recognition and requires minimal preprocessing with high accuracy.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
The document describes a machine transliteration system that transliterates Hindi and Marathi names and words to English using support vector machines (SVM). It segments source language names into phonetic units, and trains an SVM classifier using phonetic units and n-grams as features to label each unit with its English transliteration. The system achieves good accuracy for Hindi-English and Marathi-English transliteration.
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
Rule-based machine translation systems were evaluated based on errors in translations from English to Persian. Several error categories were identified including syntactic errors (word order, missing words, parts of speech), unknown words, and semantic errors (incorrect words, idiomatic expressions). Three texts (a short story, user guide, and magazine article) were translated using two machine translation systems and analyzed sentence-by-sentence to identify errors according to the defined categories.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
Hybrid part of-speech tagger for non-vocalized arabic textijnlc
The document presents a hybrid part-of-speech tagging method for Arabic text that combines rule-based and statistical approaches. Rule-based tagging alone can misclassify words and leave some untagged, so the method integrates it with a Hidden Markov Model tagger. The hybrid approach is evaluated on two Arabic corpora and achieves accuracy rates of 97.6% and 98%, outperforming the individual rule-based and HMM taggers.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Machine translation with statistical approachvini89
Machine translation systems aim to automatically translate text from one human language to another. Statistical machine translation approaches view every sentence in one language as a potential translation of any sentence in the other language. They assign a probability to every pair of sentences to indicate how likely one is a translation of the other. Statistical machine translation considers individual sentences without context and has many acceptable translations for each, with the choice being a matter of taste.
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...CSCJournals
- The document describes the development of a bi-directional English to Yoruba translator for real-time mobile chatting.
- It uses a hybrid machine translation approach combining rule-based and corpus-based methods. Parallel corpora and computational rules were developed and stored in databases. Dictionaries containing words in both languages and their parts of speech were also created.
- The system was implemented using PHP for server-side scripting, JSON for data interchange, and an Android app for the user interface. Testing showed 95% accuracy compared to Google Translate.
Machine translation from English to HindiRajat Jain
Machine translation a part of natural language processing.The algorithm suggested is word based algorithm.We have done Translation from English to Hindi
submitted by
Garvita Sharma,10103467,B3
Rajat Jain,10103571,B6
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
A large amount of data in Indian languages stored digitally is in ASCII-based font formats. ASCII has 128
character-set, therefore it is unable to represent all the characters necessary to deal with the variety of
scripts available worldwide. Moreover, these ASCII-based fonts are not based on a single standard
mapping between the character-codes and the individual characters, for a particular Indian script, unlike
the English language fonts based on the standard ASCII mapping. Therefore, it is required that the fonts for
a particular script must be available on the system to accurately represent the data in that script. Also, the
conversion of data in one font into another is a difficult task. The non-standard ASCII-based fonts also pose
problems in performing search on texts in Indian languages available over web. There are 25 official
languages in India, and the amount of digital text available in ASCII-based fonts is much larger than the
text available in the standard ISCII (Indian Script Code for Information Interchange) or Unicode formats.
This paper discusses the work done in the field of font-detection (to identify the font of the given text) and
font-converters (to convert the ASCII-format text into the corresponding Unicode text).
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
The document discusses different approaches to machine translation, including rule-based, statistical, example-based, and dictionary-based approaches. It provides details on each approach, such as rule-based methods using linguistic rules and extensive lexicons, statistical methods relying on probabilistic models trained on parallel texts, example-based methods translating by analogy to examples in aligned corpora, and dictionary-based methods translating words directly with or without morphological analysis. The document also compares transfer-based and interlingual rule-based machine translation, noting interlingual methods aim to represent the source text independently of languages.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
1) The document discusses the history and evolution of machine translation from human translation to current automatic systems. It covers early rule-based, statistical, and example-based systems as well as current hybrid approaches.
2) Key challenges in machine translation include accurately understanding syntax, semantics, idioms and context in different languages. Early systems lacked fluency or made exceptions.
3) The future aims for fully automatic translation between any texts or speech with full meaning conveyed, allowing improved global communication and cultural understanding.
Marathi Text-To-Speech Synthesis using Natural Language Processingiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
Mediterranean Arabic Language and Speech Technology ResourcesHend Al-Khalifa
The MEDAR survey collected responses from 57 players involved in human language technologies and language resources for Arabic. The responses came from a variety of countries, with the highest numbers coming from Egypt (11), Morocco (10), and West Bank & Gaza Strip (9). The majority of respondents (36) answered on behalf of institutions, while 17 answered as independent experts. The survey gathered information on the respondents' profiles, language resources, needs for resources, and market information to understand the current state of language technologies for Arabic.
A SURVEY OF MARKOV CHAIN MODELS IN LINGUISTICS APPLICATIONScsandit
Markov chain theory isan important tool in applied probability that is quite useful in modeling real-world computing applications.For a long time, rresearchers have used Markov chains for data modeling in a wide range of applications that belong to different fields such as computational linguists, image processing, communications,bioinformatics, finance systems, etc. This paper explores the Markov chain theory and its extension hidden Markov models (HMM) in natural language processing (NLP) applications. This paper also presents some aspects related to Markov chains and HMM such as creating transition matrices, calculating data sequence probabilities, and extracting the hidden states.
ALTERNATIVES TO BETWEENNESS CENTRALITY: A MEASURE OF CORRELATION COEFFICIENTcsandit
In this paper, we measure and analyze the correlation of betweenness centrality (BWC) to five centrality measures, including eigenvector centrality (EVC), degree centrality DEG),
clustering coefficient centrality (CCC), farness centrality (FRC), and closeness centrality(CLC). We simulate the evolution of random networks and small-world networks to test the correlation between BWC and the five measures. Additionally, nine real-world networks are also involved in our present study to further examine the correlation. We find that DEG is
highly correlated to BWC on most cases and can serve as alternative to computationallyexpensive BWC. Moreover, EVC, CLC and FRC are also good candidates to replace BWC on
random networks. Although it is not a perfect correlation for all the real-world networks, there still exists a relatively good correlation between BWC and other three measures (CLC, FRC and EVC) on some networks. Our findings in this paper can help us understand how BWC correlates to other centrality measures and when to decide a good alternative to BWC
We present a framework that combines machine learnt classifiers and taxonomies of topics to enable a more conceptual analysis of a corpus than can be accomplished using Vector Space Models and Latent Dirichlet Allocation based topic models which represent documents purely
in terms of words. Given a corpus and a taxonomy of topics, we learn a classifier per topic and annotate each document with the topics covered by it. The distribution of topics in the corpus can then be visualized as a function of the attributes of the documents. We apply this framework to the US State of the Union and presidential election speeches to observe how topics such as jobs and employment have evolved from being relatively unimportant to being the most discussed topic. We show that our framework is better than Vector Space Models and an Latent Dirichlet Allocation based topic model for performing certain kinds of analysis.
THE IMPACT OF EXISTING SOUTH AFRICAN ICT POLICIES AND REGULATORY LAWS ON CLOU...csandit
Cloud computing promises good opportunities for economies around the world, as it can help reduce capital expenditure and administration costs, and improve resource utilization. However there are challenges regarding the adoption of cloud computing, key amongst those are security and privacy, reliability and liability, access and usage restriction. Some of these challenges lead to a need for cloud computing policy so that they can be addressed. The purpose of this paper is
twofold. First is to discuss challenges that prompt a need for cloud computing policy. Secondly, is to look at South African ICT policies and regulatory laws in relation to the emergence of cloud computing. Since this is literature review paper, the data was collected mainly through literature reviews. The findings reveals that indeed cloud computing raises policy challenges that needs to be addressed by policy makers. A lack of policy that addresses cloud computing challenges can
negatively have an impact on areas such as security and privacy, competition, intellectual property and liability, consumer protection, cross border and juridical challenges.
MODEL CHECKERS –TOOLS AND LANGUAGES FOR SYSTEM DESIGN- A SURVEYcsandit
For over four decades now, variants of Model Checkers are being used as an approach for formal verification of systems consisting of software, hardware or combination of both. Though various model checking tools are available like NuSMV, UPPAAL, PRISM, PAT,FDR, it is difficult to comprehend their usage for systems in different domains like telecommunication, automobile, health and entertainment. However, industry experts and researchers have showcased the use of formal verifications techniques in various domains including Networking, Security and Semiconductor design. With current generation systems becoming more complex, there is an urgent need to better understand and use appropriate methodology, language and tool for definite domain. In this paper, we have made an effort to present Model checking in detail with relevance to available tools and languages to specific domain. For novices in the field, this paper would provide knowledge of model checkers languages and tools that would be suitable for various purposes in diverse systems
FORMAL MODELING AND VERIFICATION OF MULTI-AGENTS SYSTEM USING WELLFORMED NETScsandit
Multi-agent systems are asynchronous and distributed computer systems. These characteristics make them also a discrete-event dynamic system. It is, therefore, important to analyze the behavior of such systems to ensure that they terminate correctly and satisfy other important properties. This paper presents a formal modeling and analysis of MAS, based on Well-formed Nets, in order to ensure the absence of any undesired or unexpected behavior. To validate our ontribution, we consider the timetable problem, which is a multi-agent resource allocation problem.
RECOGNITION OF RECAPTURED IMAGES USING PHYSICAL BASED FEATUREScsandit
With the development of multimedia technology and digital devices, it is very simple and easier to recapture a high quality images from LCD screens. In authentication, the use of such
recaptured images can be very dangerous. So, it is very important to recognize the recaptured images in order to increase authenticity. Image recapture detection (IRD) is to distinguish realscene images from the recaptured ones. An image recapture detection method based on set of physical based features is proposed in this paper, which uses combination of low-level features including texture, HSV colour and blurriness. Twenty six dimensions of features are xtracted to train a suppo rt vector machine classifier with linear kernel. The experimental results show that the proposed method is efficient with good recognition rate of distinguishing real scene images from the recaptured ones. The proposed method also possesses low dimensional features compared to the state-of-the-art recaptured methods.
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONcsandit
Sequencing projects arising from high throughput technologies including those of sequencing DNA microarrays allowed to simultaneously measure the expression levels of millions of genes of a biological sample as well as annotate and identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information,
bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the hypothesis of researchers throughout the experimental cycle. In this context, this article describes and discusses some of techniques used for the functional analysis of gene expression data.
APPROACH MULTI-AGENTS EMBEDDED ALARM IN POTROOMScsandit
Industrial Shop Floor environments require fast intervention of the controller’s computers and operators to ensure high ndustrial production efficiency. This work focuses the lectrolytic
potrooms process control efficiency. The main goal of this work is to design an embedded solution for detection and alarm, using multi-agents system technologies so that controllers can
alert plant operators about the problems in the electrolytic pot still malfunctioning after controller intervention. If the controller action was unsuccessful due to a feeder and pot bus problem, an audio alarm is immediately issued to the potrooms so that the operator can be notified about the specific problem, independently of the potrooms location.
STOCHASTIC MODELING TECHNOLOGY FOR GRAIN CROPS STORAGE APPLICATION: REVIEWijaia
Stochastic modeling is a key technique in event prediction and forecasting applications. Recently, stochastic models such as the Artificial Neural Network, Hidden Markov, and Markov hain have received a significant attention in agricultural application. These techniques are capable of predicting the actions for the better planning and management in various fields. This work comprehensively summarizes and compares their applications such as their processing techniques, performance, as well as their strengths and limitations with regard to event prediction and forecasting. The work ends with recommendations on
the appropriate techniques for cereal grain storage application.
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMSijseajournal
Runtime monitoring is essential for the violation detection during the underlying software system execution. In this paper, an investigation of the monitoring activity of MAPE-K control loop is performed which aims at exploring:(1) the architecture of the monitoring activity in terms of the involved components
and control and data flow between them; (2) the standard interface of the monitoring component with other MAPE-K components; (3) the adaptive monitoring and its importance to the monitoring overhead issue; and (4) the monitoring mode and its relevance to some specific situations and systems. This paper also presented a Java framework for the monitoring process for self adaptive systems.
PERFORMANCE EVALUATION OF OSPF AND RIP ON IPV4 & IPV6 TECHNOLOGY USING G.711 ...IJCNCJournal
Migration from IPv4 to IPv6 is still visibly slow, mainly because of the inherent cost involved in the implementation, hardware and software acquisition. However, there are many values IPv6 can bring to the
IP enabled environment as compared to IPv4, particularly for Voice Over Internet Protocol (VoIP) solutions. Many companies are drifting away from circuit based switching such as PSTN to packet based switching (VoIP) for collaboration. There are several factors determining the effective utilization and
quality of VoIP solutions. These include the choice of codec, echo control, packet loss, delay, delay variation (jitter), and the network topology. The network is basically the environment in which VoIP is deployed. State of art network design for VoIP technologies requires impeccable Interior Gateway routing
protocols that will reduce the convergence time of the network, in the event of a link failure. Choice of CODEC is also a main factor. Since most research work in this area did not consider a particular CODEC as a factor in determining performance, this paper will compare the behaviour of RIP and OSPF in IPv4
and IPv6 using G.711 CODEC with riverbed modeller17.5.
COMBINING REUSABLE TEST CASES AND CONTINUOUS SECURITY TESTING FOR REDUCING WE...ijseajournal
In network communication age, information technology is being at the continuous and rapid evolution
process. Network access equipment, information system and Web Apps must rapidly and continuously update to meet the user interested requirements. Major challenge of Web Apps frequent changes is the security of user personal data and transactions information. Vulnerability scanning and penetration testing are the routine methods to improve the security of Web App. However, these two ways not only timeconsuming, but also require too many resources. For coping the continuous changes, in the limited resources, security testing not only need to be timely completed, but also should concern testing quality. Otherwise, every change mainte nance cannot avoid to cause the security risk of new version App. Based on reusable test cases, this paper proposes the continuous security testing procedure (CSTP), using test cases reusability to increase security test efficiency. In Web Apps maintenance process of limited resources, CSTP
can timely handle security testing and quickly identify Web Apps vulnerabilities and defects. Assisting Apps maintainer effectively repair security defects and concretely improve the security of user personal data and transaction information
UBIQUITOUS COMPUTING AND SCRUM SOFTWARE ANALYSIS FOR COMMUNITY SOFTWAREijseajournal
This document discusses various technologies related to ubiquitous computing and their attributes. It compares technologies such as microelectronics, power supply, sensor technology, communication technology, localization technology, security technology, machine to machine communication, and human machine interface. It discusses their attributes in terms of mobility, embeddedness, ability to form ad-hoc networks, and context awareness. The comparison highlights how each technology enables or limits ubiquitous computing applications.
TRACEABILITY OF UNIFIED MODELING LANGUAGE DIAGRAMS FROM USE CASE MAPSijseajournal
The Unified Modeling Language (UML) is a general purpose modeling language for specifying, constructing and documenting the artifacts of software systems. It is used in developing systems by combining the use of different types of diagrams to express different views of the systems. These diagrams allow transition between requirements and implementation. The lack of traceability between the diagrams
makes any changes difficult and expensive. In this paper, it is proposed using the Use Case Maps (UCMs) notation which allows the full description of the system in terms of high-level causal scenario and helps in visualizing and understanding the system in early stage. UCMs was used in the early stage to describe the system and generate the proper UML diagrams from UCMs. By defining a traceability relationship between UCMs and UML, we facilitate the maintains and the consistency of the UML diagrams.
EFFICIENCY OF SOFTWARE DEVELOPMENT AFTER IMPROVEMENTS IN REQUIREMENTS ENGINEE...ijseajournal
In the past decade multiple challenges arose from the method of software development [4, 4]. As described by Davenport, the development process needs an overhaul [4, 4]. Different disciplines, like project management, requirements engineering, the development of code or quality assurance have been investigated intensively, in order to improve the productivity of development. To obtain valid results, the
overhaul needs to start with the refactoring of the right process at first. Often, it is sensible to start with such processes, which operate at the interface to the customer, because they are perhaps the most critical to an organization’s success [3, 270 – 271]. Mainly, software development consists of four sub processes:
requirements engineering, development, quality assurance and delivery. Requirements engineering and
delivery operate on the interface to the customers. Because of the fact that the analysis of requirements is
groundbreaking, we select this process as the starting point of a process innovation initiative. We analyse
the impact of requirements engineering in KANBAN development processes. Special emphasis is put on the
productivity of the overall development process, after a refactoring of requirements engineering.
COMPARATIVE STUDY FOR PERFORMANCE ANALYSIS OF VOIP CODECS OVER WLAN IN NONMOB...Zac Darcy
Voice over IP (VoIP) applications such as Skype, Google Talk, and FaceTime are promising technologies for providing cheaper voice calls to end users over extant networks. ireless networks such as WiMAX and Wi-Fi focus on providing perfection of service for VoIP. However, there are numerous aspects that affect quality of voice connections over wireless networks [13]. The adoption of Voice over Wireless Local Area
Network is on tremendous increase due its relief, non-invasive, economicexpansion, low maintenance cost, universal coverage and basic roaming capabilities. However, expansion Voice over Internet Protocol (VoIP) over Wireless Local Area Network (WLAN) is a challenging task for many network specialist and engineers. Voice codec is one of the most critical components of a VoIP system. In this project, we evaluate
the performance analysis of various codecs such as G.711, G.723 and G.729 over Wi-Fi networks. NS2 WiFi
simulation models are designed. Performance metrics such as Mean Opinion Score (MOS), average
end-to-end latency, and disconcert are evaluated and discussed [13]. 1. In this paper, our area of interest is to compare and study the performance analysis of VoIP codecs in Non-mobility scenarios by changing some parameters and plotting the graphs throughput, End to end Delay, MOS, Packet delivery Ratio, and Jitter by using Network Simulator version.
2. In this paper we analyze the different performance parameters, Recent research has focused on simulation studies with non- mobility scenarios to analyze different VoIP codecs with nodes up to 5. We have simulated the different VoIP codecs in non-mobility scenario with nodes up to 300.
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIALZac Darcy
In this paper we generalized recently introduced approach of estimation of time scales of mass transport in inhomogenous materials under influence of inhomogenous potential field. Some examples of using of the approach were considered.
EVALUATION OF SOFTWARE DEGRADATION AND FORECASTING FUTURE DEVELOPMENT NEEDS I...ijseajournal
This article is an extended version of a previously published conference paper. In this research, JHotDraw (JHD), a well-tested and widely used open source Java-based graphics framework developed with the best software engineering practice was selected as a test suite. Six versions of this software were profiled, and data collected dynamically, from which four metrics namely (1) entropy (2) software maturity index, COCOMO effort and duration metrics were used to analyze software degradation, maturity level and use
the obtained results as input to time series analysis in order to predict effort and duration period that may
be needed for the development of future versions. The novel idea is that, historical evolution data is used to
project, predict and forecast resource requirements for future developments. The technique presented in
this paper will empower software development decision makers with a viable tool for planning and decision
making.
CENTROG FEATURE TECHNIQUE FOR VEHICLE TYPE RECOGNITION AT DAY AND NIGHT TIMESijaia
This work proposes a feature-based technique to recognize vehicle types within day and night times. Support vector machine (SVM) classifier is applied on image histogram and CENsus Transformed
histogRam Oriented Gradient (CENTROG) features in order to classify vehicle types during the day and night. Thermal images were used for the night time experiments. Although thermal images suffer from low image resolution, lack of colour and poor texture information, they offer the advantage of being unaffected by high intensity light sources such as vehicle headlights which tend to render normal images unsuitable
for night time image capturing and subsequent analysis. Since contour is useful in shape based categorisation and the most distinctive feature within thermal images, CENTROG is used to capture this feature information and is used within the experiments. The experimental results so obtained were
compared with those obtained by employing the CENsus TRansformed hISTogram (CENTRIST). Experimental results revealed that CENTROG offers better recognition accuracies for both day and night times vehicle types recognition.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
This document discusses the challenges of building conversational agents (CAs) in Arabic compared to English. It outlines three main approaches to building CAs - natural language processing, sentence similarity measures, and pattern matching - and explores how each approach presents different challenges for Arabic versus English. Some key challenges for Arabic include its complex morphology system involving roots, affixes and patterns; omission of short vowels leading to ambiguity; and diglossia between modern standardized Arabic, classical Arabic, and various dialects. The document argues these features make it harder to understand and analyze user utterances in Arabic CAs compared to English CAs.
ARABIC LANGUAGE CHALLENGES IN TEXT BASED CONVERSATIONAL AGENTS COMPARED TO TH...ijcsit
This paper is not to compare between the Arabic language and the English language as natural languages.
Instead, it focuses on the comparison among them in terms of their challenges in building text based
Conversational Agents (CAs). A CA is an intelligent computer program that used to handle conversations
among the user and the machine. Nowadays, CAs can play an important role in many aspects as this work
figured. In this paper, different approaches that can be used to build a CA will be differentiated. In each
approach, the comparison aspects among the Arabic and English languages will be debated with the
respect to the Arabic language.
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
This document describes the development of a Rule-Based Machine Translation system between Assamese and English using the Apertium platform. It discusses the files and tools used, including monolingual dictionaries for Assamese and English, a bilingual dictionary, and transfer rules. The methodology section outlines the Apertium architecture and modules, and describes how the dictionaries and transfer rules were created to translate between the two languages.
The document discusses machine translation (MT) between Arabic and English. It covers several key topics:
1. It outlines the challenges of Arabic natural language processing and MT, including the differences between Modern Standard Arabic and dialects and a lack of annotated resources.
2. It describes different types of MT systems like direct translation engines and those using linguistic knowledge architectures. It also discusses the importance of dictionaries.
3. It discusses common MT problems such as ambiguity and differences between languages.
4. It proposes a small prototype Arabic to English MT model to demonstrate basic techniques like normalization, tokenization, stemming and using a parser and transformation rules.
This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone. The newmethod proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.
People across the globe have access to materials such as journals, articles, adverts etc. via the internet. However
many of these resources come in diverse nature of languages. Although, English language seems most suitable to
most people, some readers do believe that working on materials in one’s native language is more enjoyable than in
other languages. Researches have shown that Arabic language has not been prominent in terms of online materials
and the few existing are most times ignored due to the peculiar nature of its various characters and constructs.
Hence, a proper study of its relationship with English language with a view to bringing people closer to its
understanding becomes necessary. The system scenarios were modeled and implemented using Unified Modeling
Language and Microsoft C# respectively in a way that the expected set of characters of the language of interest was
automatically formed with respect to a given input. The procedural steps were properly followed in the development
and running of the code using Context-Free Rule Based Technique with the availability of hardware required as
clearly described in the design. The system’s workability was tested with different source texts as inputs and in each
case the resulting outputs were very effective with respect to the translation process. The design here is expected to
serve as a tool for assisting beginners in these two languages and so, showcases a one-to-one form of
correspondence, hence, more rules and functions for ensuring a more robust are expected in future works.
Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation Systemijtsrd
Sadhu and Cholit bhasha are two significant Bangladeshi languages. Sadhu was functional in ancient era and had Sanskrit components but in present era cholit took its place. There are many formal and legal paper works present in Sadhu language which direly need to be translated in Cholit because its more favorable and speaker friendly. Therefore, this paper dealt with this issue by familiarizing the current era with Sadhu by creating a software. Different sentences were chosen and final data set was obtained by Principal Component Analysis PCA . MATLAB and Python are used for different machine learning algorithms. Most work is being done using Scikit Learn and MATLAB machine learning toolbox. It was found that Linear Discriminant Analysis LDA functions best. Speed prediction was also done and values were determined through graphs. It was inferred that this categorizer efficiently translated all Sadhu words to Cholit precisely and in well structured way. Therefore, Sadhu will not remain a complex language in this decade. Nakib Aman Turzo | Pritom Sarker | Biplob Kumar "Interpretation of Sadhu into Cholit Bhasha by Cataloguing and Translation System" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-3 , April 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30792.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30792/interpretation-of-sadhu-into-cholit-bhasha-by-cataloguing-and-translation-system/nakib-aman-turzo
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat
services. As a result, a text containing various English and Bangla words and phrases has become
increasingly common. Existing transliteration tools display poor performance for such texts. This paper
proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English
words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of
dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it
significantly outperforms the benchmark transliteration tool.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses different techniques for improving Arabic information retrieval from automatically transcribed Arabic speech, including normalization, stopword removal, light stemming, heavy stemming, and tokenization. It presents the results of an experiment applying these techniques to Arabic text extracted from television news videos. The techniques of normalization, stopword removal, and light stemming increased retrieval precision according to the experiment, while heavy stemming and trigrams had a negative effect. The choice of machine translation engine for the English queries was also shown to significantly impact retrieval effectiveness.
The document proposes a method to classify online handwritten words and lines into six scripts: Arabic, Cyrillic, Devnagari, Han, Hebrew, or Roman. It extracts 11 spatial and temporal features from stroke data and achieves 87.1% accuracy classifying words and 95.5% for lines using 5-fold cross validation. Existing methods classify offline documents but online documents provide temporal stroke sequence data enabling new classification approaches.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
Similar to DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL (20)
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Thinking of getting a dog? Be aware that breeds like Pit Bulls, Rottweilers, and German Shepherds can be loyal and dangerous. Proper training and socialization are crucial to preventing aggressive behaviors. Ensure safety by understanding their needs and always supervising interactions. Stay safe, and enjoy your furry friends!
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
2. 50 Computer Science & Information Technology (CS & IT)
a. Translation disambiguation, due to homonymy and polysemy [6] creates problems to find
the most appropriate translation for a given word
b. Lacking appropriate resources for evaluations of CLIR with low density languages
c. Inflection words in the query cannot be easily located as translated root words in the
dictionary, due to stemming
d. New words get added to the language which may not be recognized by the existing
system, resulting in out of vocabulary (OOV) and
e. Most of OOV words such as technical terms and named entities in the query reduces the
performance of the system
According to Cardenosa et.al, [5], CLIR approaches can be categorized into three; Document
translation, Query translation, and Interlingua translation.
• In document translation, every document has to be translated into the query language
and then retrieval will be performed using classical IR techniques. It can be applied
offline to produce translations of all documents well in advance and offers the possibility
to access the content in his/her own language. However, machine or (large scale) human
translation may not always be a realistic option for every language pair as it is time
consuming since every document needs to translated to other languages irrespective of
their usage.
• Query translation approach is the translation of query terms from source language to
the target language. In this approach online translation can be applied to the query
entered by a user and it is possible for a user to reformulate, elaborate or narrow down the
translated query. Translating a query by dictionary look-up is far more efficient than
translating entire document collection. However, it is unreliable since short queries do not
provide enough contexts for disambiguation in choosing proper translation of query
words and does not exploit domain-specific semantic constraints and corpus statistics in
solving translation ambiguity.
• In Interlingua translation approach, the source language, i.e. the text to be translated is
transformed into an Interlingua, i.e., an abstract language-independent representation.
The target language is then generated from the Interlingua. This approach is useful if
there are no resources for a direct translation but it has lower performance than direct
translation.
Translation techniques in CLIR are categorized into direct and indirect translation [7]. Direct
translation uses Machine Readable Dictionary (MRD), parallel corpora, and machine translation
algorithm or in combination.
• In Dictionary based translation the query words are translated to the target language
using MRD [8]. MRDs are electronic versions of printed dictionaries, and may be general
dictionaries, specific domain dictionaries, or a combination of both. It has been adopted
in CLIR because bilingual dictionaries are widely available.
3. Computer Science & Information Technology (CS & IT) 51
• Parallel corpora contain a set of documents and their translations in one or more other
languages. These paired documents can be used to meet the most likely translations of
terms between languages in the corpus.
• Query translation can be implemented by using a Machine Translation (MT) system to
translate documents in one languages in the corpora into the language of a user’s query
which can be done offline in advance or online [9].
Indirect translation is a common solution when there is an absence of resources supporting direct
translation. It can be applied by transitive or dual translation system. In case of transitive
translation, the use of an intermediary (pivot) language, which is placed between the source query
and the target document collection, is used to enable comparison with the target document
collection. In the case of dual translation systems, both the query and the document
representations are translated into the intermediate language [10].
In all the above-mentioned cases, a key element is the mechanism to map between languages.
This translation knowledge can be encoded in different forms as a data structure of query and
document-language term correspondences in a MRD or as an algorithm, such as a machine
translation or machine transliteration system [11]. While all of these forms are effective, the latter
require substantial investment of time and resources for the development and it is not widely or
readily available for many language pairs.
CLIR is becoming a promising field of research which bridges the gap between different
languages and hence between different people speaking different languages and of different
culture. As CLIR is in its infancy, many works related to many language pairs are attempted.
Amharic-Arabic is one such language pairs which needs to explore for CLIR.
According to the 2007 census, Amharic speakers encompass 26.9% of Ethiopia’s population.
Amharic is also spoken by many people in Israel, Egypt and Sweden [1]. Arabic is a natural
language spoken by 250 million people in 21 countries as the first language, and Islamic countries
as a second language [8]. Ethiopia is one of the countries, which have more than 33.3% of the
population who follow Islam, and they use Arabic language to teach religion and for
communication purpose. The Arabic and Amharic languages belonging to the Semitic family of
languages [12], where the words in such languages are formed by modifying the root itself
internally and not simply by the concatenation of affixes to word roots. Amharic and Arabic are
very rich morphology languages.
The current Amharic writing system consists of a core of thirty-three characters (ፊደል, fidel) each
of which occurs in a basic form and in six other forms known as orders [1]. The non-basic forms
are derived from the basic forms by more-or-less regular modifications. Thus, there are 231
different characters. The seven orders represent syllable combinations consisting of consonant
and following vowel. This characteristic according to Abebayehu [13], makes the Amharic
writing system a syllabic writing system. A character or a symbol is used to represent a phoneme,
which is a combination of a vowel and a consonant. These are written in a unique script that is
now supported in Unicode (U+1200 - U+137F) [14].
4. 52 Computer Science & Information Technology (CS & IT)
The Arabic alphabet consists of 28 characters or 29 characters if the Hamza is considered as a
separate character. It is written from right to left like Persian, Hebrew, unlike many international
languages. Three of the Arabic characters appear in different shapes as follows [15][16]:
• Hamza ()ء is sometimes written :,ا ِإ or أ (alif)
• Ta marbouta ()ة like t in English found atthe end without two dots ( o = ha)
• Alifmaqsurah ()ى is the character (=يya ) without dots.t
The above three characters pose some difficulties in the setting up a CLIR system. Some of
Arabic language resources ignore the Hamza and the dots (.) above “ta marbouta” to unite the
input and output for these characters. In Arabic there is a whole series of non-alphabetic signs,
added above or below the consonant letters to make the reading of the word less ambiguous.
Both Arabic and Amharic languages possess translation challenges for many reasons [17][18];
such as Arabic sentences are usually long and punctuation has no or little effect on interpretation
of the text. Contextual analysis is important in Arabic and Amharic in order to understand the
exact meaning of some words. For example, in Amharic, the word “ገና” can have the meaning of
Christmas holiday or waiting something until it happens. Characters are sometimes stretched for
justified text, which hinders the exact much for same word. In Arabic, synonyms are very
common. For example, “year” has three synonyms in Arabic َامع، ،حول سنة and all are widely used
in every day communication. Another challenge in Arabic is the absence of discretization
(sometimes called vocalization). Discretization can be defined as a symbol over and underscored
letters, which are used to indicate the proper pronunciations as well as for disambiguation
purposes. The absence of discretization in Arabic texts poses a real challenge for Arabic natural
language processing, As well as for translation, leading to high ambiguity. Though the use of
discretization is extremely important for readability and understanding, they don’t appear in most
printed media in Arabic regions nor on Arabic Internet web sites. They are visible in religious
texts such as Quran, which is fully discretised in order to prevent misinterpretation.
Ethiopia has good socio-economic relationships with Arabic countries; they are communicating
using the Arabic and Amharic languages. For example, reports sent between Ethiopia and Arabic
countries need to be written in both languages, and most of the new and translated religious books
are written in both languages by Muslim scholars. Similar to English, a large amount of
unstructured documents are available on the net in Arabic and Amharic languages. However, IR
tools and techniques are mostly English language oriented, and currently there are several
attempts to develop IR tools for Arabic and Amharic language. Many of Internet users who are
non-native Arabic speakers can read and understand Arabic documents but they feel
uncomfortable to formulate queries in Arabic. This may be either because of their limited
vocabulary in Arabic, or because of the possible miss-usage of Arabic words. Different attempts
have been made to develop CLIR systems for Amharic-French [19] and Afan Oromo-English [3]
languages. Nevertheless, CLIR system is not found for Amharic-Arabic language pair.
Development of standard corpus and tools is very essential in order to test the performance of the
newly developed CLIR system [20]..
5. Computer Science & Information Technology (CS & IT) 53
The aim of this research work is to develop a prototype of dictionary based Amharic-Arabic
CLIR system that enables Amharic and Arabic language users to retrieve both language
documents and to examine the ability of the proposed system. We employee query translation
strategy, which is more efficient than document translation strategy, because the document
translation strategy require overhead cost of translating all documents, especially when new
documents are added frequently and not all of the documents are of interest to the users [21].
The remainder of this paper is organized as follows; the review of related works is presented in
Section 2 and the proposed CLIR method in Section 3. Section 4 gives the experimental setup and
the results and the paper conclude in Section 5.
2. RELATED WORKS
Several researchers have studied CLIR works related to different language pairs. However, less
work is reported on Amharic and Arabic languages paired with other languages. Some of the
prominent works are discussed below
Argaw Atelach Alemu, et.al [19], present a dictionary based approach to translate the Amharic
queries into French Bags-of-words in the Amharic-French bilingual track at CLEF 2005 using the
search engines: SICS and Lucene. Non-content bearing words were removed both before and
after the dictionary lookup. TF/IDF values supplemented by a heuristic function was used to
remove the stop words from the Amharic queries and two French stop words lists were used to
remove stop words from French translations. From the experiments, they found that the SICS
search engine performed better than Lucene. Aljlayl et.al [1], empirically evaluated the use of an
MT-based approach for query translation in an Arabic-English CLIR system using TREC-7 and
TREC-9 topics and collections. The effect of query length on the performance of MT is also
investigated to explore how much context is actually required for successful MT processing. A
well-formed source query makes the MT system able to provide its best accuracy. Tesfaye Fasika
[20], employed a corpus based approach which makes use of phrasal query translation for
Amharic-English CLIR. The result of the experimentation is a recall value of 24.8% for translated
Amharic queries, 46.3% for Amharic queries and 43.6% for the baseline English queries.
Nigussie Eyob [7], have developed a corpus based Afaan Oromo–Amharic CLIR system to
enable Afaan Oromo speakers to retrieve Amharic information using Afaan Oromo queries.
Documents including news articles, bible, legal documents and proclamations from customs
authority were used as parallel corpus. Two experiments were conducted, by allowing only one
possible translation to each Afaan Oromo query term and by allowing all possible translations.
The first experiment returned a maximum average precision of 81% and 45% for monolingual
(Afaan Oromo) queries and bilingual (translated Amharic) queries run respectively. The second
experiment showed better result of recall and precision than the first experiment, which is 60%
for the bilingual query run, and the result for the monolingual query run remained the same.
Mequannint et al. [22], designed a model for an Amharic-English Search Engine and developed a
bilingual Web search engine based on the model that enables Web users for finding the
information they need in Amharic and English languages. They have identified different language
dependent query pre-processing components for query translation and developed a bidirectional
dictionary-based translation system, which incorporates a transliteration component to handle
proper names, which are often missing in bilingual lexicons. They used an Amharic search engine
and an open source English search engine (Nutch) for Web document crawling, indexing,
6. 54 Computer Science & Information Technology (CS & IT)
searching, ranking and retrieving. The experimental results showed that the Amharic-English
Cross-Lingual Retrieval engine performed 74.12% of its corresponding English monolingual
retrieval engine and the English-Amharic Cross-Lingual Retrieval engine performed 78.82% of
its corresponding Amharic monolingual retrieval engine.
In CLIR, the semantic level of words is crucial. Solving the problem of word sense
disambiguation will enhance the effectiveness of CLIR systems. Andres Duque et al [23], studied
to choose the best dictionary for Cross Lingual Word Sense Disambiguation (CLWSD). They
applied the comparison between different dictionaries in two different frameworks; analysing the
potential results of an ideal system using those dictionaries and considering the particular
unsupervised CLWSD system Co-occurrence Graph, then analyse the results obtained when using
different bilingual dictionaries providing the potential translations. They also developed hybrid
system by combining the results provided by a probabilistic dictionary, and those obtained with a
Most Frequent Sense (MFS) approach. They have focused on only on English- Spanish cross-
lingual disambiguation. The hybrid approach outperforms the results obtained by other
unsupervised systems.
As Arabic is a relatively widely researched Semitic language and has a number of common
properties that share with Amharic, some of the computational linguistic research [1],[19],[24],
conducted on Amharic and Arabic languages nowadays recommended customizing and using the
tools developed for these languages. While the above researchers has attempted to develop and
evaluate Amharic and Arabic paired languages with other languages separately, no research has
these two languages paired together.
3. METHODOLOGY
In this work, an attempt has been made to design a dictionary based Amharic-Arabic CLIR
system, which has indexing and searching tasks. Inverted file indexing structure is used to
organize documents to speed up searching. The probabilistic model that attempts to simulate the
uncertainty nature of an IR system guides the searching process. Amharic and Arabic documents
are pre-processed separately by performing tokenization, normalization, stop word removal,
punctuation removal and stemming. Figure 3.1 shows the general architecture of the system,
which is adopted from C. Peters et al [25]. Bi-lingual dictionary, which includes the list of
Amharic and Arabic translated words is constructed manually and is used to translate Amharic
queries to Arabic queries.
Binary independent probabilistic information retrieval model is adopted to search the relevant
documents from Amharic-Arabic parallel corpus. Probabilistic information retrieval is the
estimation of the probability of relevance that a document di will be judged relevant by the user
with respect to query q, which is expressed as, P(R|q, di), where, R is the set of relevant
documents. Typically, in probabilistic model, based on the query the documents are divided into
relevant and irrelevant documents [26]. However, the probability of any document is relevant or
irrelevant with respect to users query is initially unknown. Therefore, the probabilistic model
needs to guess the relevance at the beginning of search process. The user then observes the first
retrieved documents and gives feedback for the system by selecting relevant documents as
relevant and irrelevant documents as irrelevant. By collecting relevance feedback data from a few
documents, the model can then be applied to estimate the probability of relevance for the
remaining documents in the collection. This process is applied iteratively to improve the
7. Computer Science & Information Technology (CS & IT) 55
performance of the system to retrieve more and more relevant documents, which satisfies the
users need.
Figure 3.1 Dictionary based Amharic-Arabic CLIR system architecture
8. 56 Computer Science & Information Technology (CS & IT)
The assumptions made for the uncertainty nature of probability model are;
• p(ki|R) is constant for all index terms k (usually, its equal to 0.5)
• The distribution of index terms among the non-relevant documents can be approximated
by the distribution of index terms among all the documents in the collection.
These two assumptions will give;
where, N is the total number of documents in the collection and ni is the number of documents
which contain the index term ki.
4. EXPERIMENTATION AND EVALUATION
The Holy Quran available through Tanzile Quran navigator website [27] includes 114 chapters,
each containing a minimum of 3 to a maximum of 286 verses in Arabic Amharic languages. In
this work, subject to the availability of the number of verses, we have downloaded upto 10 verses
from each chapter in Arabic and the corresponding verses in Amharic.
Even though complete evaluation process requires the evaluation of both system effectiveness
and efficiency, only effectiveness of IR system is taken into consideration to determine the
performance of the system for the translated queries. Precision and recall are used to measure the
effectiveness of the IR system designed.
We used Amharic queries for the retrieval of documents both in Arabic and Amharic languages.
In addition to retrieving Amharic documents, the Amharic query is translated into Arabic for
retrieving Arabic documents. We used 14 simple queries to test the performance of the system
and the results obtained are shown in Table 4.1. The performance of the system on Arabic
relevant retrieved documents is much better than that of Amharic documents (i.e., 83.89%
precision for Amharic against 52.02% precision for Arabic).
When the system is tested by giving queries that has Out of Vocabulary words in the dictionary,
its precision is decreased and recall is increased specially for Arabic documents. For example, if
we add a word “ለኾነው” (to become) which is not translated correctly or appeared in the
dictionary for the first query “የፍርዱ ቀን ባለቤት ለኾነው” (Financed you day of the debt) the word
“ለኾነው” (to become) is directly used for searching. Therefore, the number of Amharic non
relevant documents increased by highly decreasing the performance of the system. The main
hindrance of the system performance is incorrect translation due to unnormalized Arabic words
specifically diacritics for mapped with the dictionary words, system that cannot be.
9. Computer Science & Information Technology (CS & IT) 57
Table 4.1 Performance of the proposed system
5. CONCLUSION
Multilingual information is required for the countries that have multiple languages and it is vital
as the users of the internet throughout the world are ever increasing. We have developed a
prototype of dictionary based Amharic-Arabic CLIR system that enables Amharic and Arabic
language users to retrieve both language documents and to examine the ability of the proposed
system. The effectiveness of our proposed system was evaluated and the performance of the
system on Arabic relevant retrieved documents was much better than that of Amharic documents.
10. 58 Computer Science & Information Technology (CS & IT)
The main challenges with dictionary-based CLIR are untranslatable words due to the limitation of
Amharic Arabic general dictionary, the processing of inflected words, Phrase identification and
translation, and lexical ambiguity in Amharic and Arabic language.
Even if this research has a vital significance in retrieving the required information from Amharic-
Arabic document, some issues need to be further investigated to develop efficient and effective
CLIR system. This approach requires an exhaustive and detailed list of mapping of concepts in
both languages, which is very difficult to build.
REFERENCES
[1] M. Aljlayl, O. Frieder, and D. Grossman, “On Arabic-English cross-language information retrieval: A
machine translation approach,” in Information Technology: Coding and Computing, 2002.
Proceedings. International Conference on, 2002, pp. 2–7.
[2] K. Sourabh, “An Extensive Literature Review on CLIR and MT activities in India,” Int. J. Sci. Eng.
Res., 2013.
[3] D. Bekele, “Afaan Oromo Oromo-English Cross-Lingual Information Retrieval (Clir),” AAU, 2011.
[4] D. Kelly, “Methods for evaluating interactive information retrieval systems with users,” Found.
Trends Inf. Retr., vol. 3, no. 1—2, pp. 1–224, 2009.
[5] J. Cardeñosa, C. Gallardo, and A. Toni, “Multilingual Cross Language Information Retrieval A new
approach.”
[6] M. Abusalah, J. Tait, and M. Oakes, “Literature Review of Cross Language Information Retrieval,”
Comput. Hum., pp. 175–177, 2005.
[7] E. Nigussie, “Afaan Oromo--Amharic Cross Lingual Information Retrieval,” AAU, 2013.
[8] T. Hedlund, “Dictionary-based cross-language information retrieval: principles, system design and
evaluation,” in SIGIR Forum, 2004, vol. 38, no. 1, p. 76.
[9] M. R. Warrier and M. S. S. Govilkar, “A SURVEY ON VARIOUS CLIR TECHNIQUES.”
[10] D. Zhou, M. Truran, T. Brailsford, V. Wade, and H. Ashman, “Translation techniques in cross-
language information retrieval,” ACM Comput. Surv., vol. 45, no. 1, p. 1, 2012.
[11] G.-A. Levow, D. W. Oard, and P. Resnik, “Dictionary-based techniques for cross-language
information retrieval,” Inf. Process. Manag., vol. 41, no. 3, pp. 523–547, 2005.
[12] A. D. Rubin, “The Subgrouping of the Semitic Languages,” Linguist. Lang. Compass, vol. 2, no. 1,
pp. 79–102, 2008.
[13] S. ABEBAYEHU, “Amharic-English Script Identification in Real-Life Document Images,” aau,
2012.
[14] B. Ayalew, “The submorphemic structure of Amharic: toward a phonosemantic analysis,” University
of Illinois at Urbana-Champaign, 2013.
11. Computer Science & Information Technology (CS & IT) 59
[15] R. Tsarfaty, “Syntax and Parsing of Semitic Languages,” in Natural Language Processing of Semitic
Languages, Springer, 2014, pp. 67–128.
[16] H. Ishkewy, H. Harb, and H. Farahat, “Azhary: An arabic lexical ontology,” arXiv Prepr.
arXiv1411.1999, 2014.
[17] T. Hailemeskel, “Amharic Text Retrieval: An Experiment Using Latent Semantic Indexing (LSI) with
Singular Value Decomposition (SVD),” M. Sc. Thesis, Addis Ababa University, Addis Ababa, 2003.
[18] F. Ahmed and A. Nurnberger, “Arabic/English word translation disambiguation approach based on
na{"i}ve Bayesian classifier,” in Computer Science and Information Technology, 2008. IMCSIT
2008. International Multiconference on, 2008, pp. 331–338.
[19] A. A. Argaw, L. Asker, J. Karlgren, M. Sahlgren, and R. Cöster, “Dictionary-based Amharic-French
information retrieval,” CEUR Workshop Proc., vol. 1171, 2005.
[20] F. Tesfaye, “Phrasal Translation for Amharic English Cross Language Information Retrieval (Clir),”
AAU, 2010.
[21] M. Adriani, “Using statistical term similarity for sense disambiguation in cross-language information
retrieval,” Inf. Retr. Boston., vol. 2, no. 1, pp. 71–82, 2000.
[22] M. Munye and S. Atnafu, “Amharic-English bilingual web search engine,” in Proceedings of the
International Conference on Management of Emergent Digital EcoSystems, 2012, pp. 32–39.
[23] A. Duque, J. Martinez-Romo, and L. Araujo, “Choosing the best dictionary for Cross-Lingual Word
Sense Disambiguation,” Knowledge-Based Syst., vol. 81, pp. 65–75, 2015.
[24] S. A. L. S. F. Adafre, “Machine Translation for Amharic: Where we are,” Strateg. Dev. Mach. Transl.
Minor. Lang., p. 47.
[25] C. Peters, M. Braschler, and P. Clough, Multilingual information retrieval: From research to practice.
Springer Science & Business Media, 2012.
[26] F. Dahak, M. Boughanem, and A. Balla, “A probabilistic model to exploit user expectations in XML
information retrieval,” Inf. Process. Manag., 2016.
[27] “http://tanzil.net/#trans/am.sadiq.” .
AUTHORS
Ibrahim Gashaw Kassa, is a Ph.D. candidate at Mangalore University
Karnataka State, India since 2016. He graduated in 2006 in Information System
from Addis Ababa University, Ethiopia. In 2014, he obtained his master’s degree
in Information Technology from University of Gondar, Ethiopia., he serves as a
lecturer at University of Gondar from 2009 to May 2016. His research interest
is in Cross Language Information Retrieval.
12. 60 Computer Science & Information Technology (CS & IT)
Dr. H L Shashirekha is an Associate Professor in the Department of Computer
Science, Mangalore University, Mangalore, Karnataka State, India. She
completed her M.Sc. in Computer Science in 1992 and Ph.D. in 2010 from
University of Mysore. She is a member of Board of Studies and Board of
Examiners (PG) in Computer Science, Mangalore University. She has
several papers in International Conferences and published several papers in
International Journals and Conference Proceedings. Her area of research includes
Text Mining and Natural Language Processing.
Computer Science & Information Technology (CS & IT)
is an Associate Professor in the Department of Computer
ersity, Mangalore, Karnataka State, India. She
completed her M.Sc. in Computer Science in 1992 and Ph.D. in 2010 from
University of Mysore. She is a member of Board of Studies and Board of
Examiners (PG) in Computer Science, Mangalore University. She has presented
several papers in International Conferences and published several papers in
International Journals and Conference Proceedings. Her area of research includes
Text Mining and Natural Language Processing.