The MEDAR survey collected responses from 57 players involved in human language technologies and language resources for Arabic. The responses came from a variety of countries, with the highest numbers coming from Egypt (11), Morocco (10), and West Bank & Gaza Strip (9). The majority of respondents (36) answered on behalf of institutions, while 17 answered as independent experts. The survey gathered information on the respondents' profiles, language resources, needs for resources, and market information to understand the current state of language technologies for Arabic.
Developemnt and evaluation of a web based question answering system for arabi...ijnlc
Question Answering (QA) systems are gaining great importance due to the increasing amount of web
content and the high demand for digital information that regular information retrieval techniques cannot
satisfy. A question answering system enables users to have a natural language dialog with the machine,
which is required for virtually all emerging online service systems on the Internet. The need for such
systems is higher in the context of the Arabic language. This is because of the scarcity of Arabic QA
systems, which can be attributed to the great challenges they present to the research community,including
theparticularities of Arabic, such as short vowels, absence of capital letters, complex morphology, etc. In
this paper, we report the design and implementation of an Arabic web-based question answering
system,which we called “JAWEB”, the Arabic word for the verb “answer”. Unlike all Arabic questionanswering
systems, JAWEB is a web-based application,so it can be accessed at any time and from
anywhere. Evaluating JAWEBshowed that it gives the correct answer with 100% recall and 80% precision
on average. When comparedto ask.com, the well-established web-based QA system, JAWEBprovided 15-
20% higher recall.These promising results give clear evidence that JAWEB has great potential as a QA
platform and is much needed by Arabic-speaking Internet users across the world.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
LIT (Lexicon of the Italian Television) is a project conceived by the Accademia della Crusca, the leading research institution on the Italian language, in collaboration with CLIEO (Center for theoretical and historical Linguistics: Italian, European and Oriental languages), with the aim of studying frequencies of the Italian lexicon used in television content and targets the specific sector of web applications for linguistic research. The corpus of transcriptions is constituted approximately by 170 hours of random television recordings transmitted by the national broadcaster RAI (Italian Radio Television) during the year 2006.
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
The evolution of information Technology has led to the collection of large amount of data, the volume of
which has increased to the extent that in last two years the data produced is greater than all the data ever
recorded in human history. This has necessitated use of machines to understand, interpret and apply data,
without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due
to the complexity are very difficult to analyze. The work already performed in this area is progressing at
great pace and this work hopes to be a way to push that work further. The designed system is an effort
which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using
supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based
classification; and results are compared to in order to decide which methodology is better suited in
handling of these documents. As we will see, the plain machine learning algorithm applications are just as
or in many cases are much better in performance than the more analytical approach.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Developemnt and evaluation of a web based question answering system for arabi...ijnlc
Question Answering (QA) systems are gaining great importance due to the increasing amount of web
content and the high demand for digital information that regular information retrieval techniques cannot
satisfy. A question answering system enables users to have a natural language dialog with the machine,
which is required for virtually all emerging online service systems on the Internet. The need for such
systems is higher in the context of the Arabic language. This is because of the scarcity of Arabic QA
systems, which can be attributed to the great challenges they present to the research community,including
theparticularities of Arabic, such as short vowels, absence of capital letters, complex morphology, etc. In
this paper, we report the design and implementation of an Arabic web-based question answering
system,which we called “JAWEB”, the Arabic word for the verb “answer”. Unlike all Arabic questionanswering
systems, JAWEB is a web-based application,so it can be accessed at any time and from
anywhere. Evaluating JAWEBshowed that it gives the correct answer with 100% recall and 80% precision
on average. When comparedto ask.com, the well-established web-based QA system, JAWEBprovided 15-
20% higher recall.These promising results give clear evidence that JAWEB has great potential as a QA
platform and is much needed by Arabic-speaking Internet users across the world.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVALcsandit
The demand for multilingual information is becoming erceptive as the users of the internet throughout the world are escalating and it creates a problem of retrieving documents in one language by specifying query in another language. This increasing demand can be addressed by designing automatic tools, which accepts the query in one language and retrieves the relevant documents in other languages. We have developed prototype Amharic-Arabic Cross Language
Information Retrieval System by applying dictionary-based approach that enables the users to retrieve relevant documents from Amharic-Arabic corpus by entering the query in Amharic and retrieving the relevant documents both Amharic and Arabic.
LIT (Lexicon of the Italian Television) is a project conceived by the Accademia della Crusca, the leading research institution on the Italian language, in collaboration with CLIEO (Center for theoretical and historical Linguistics: Italian, European and Oriental languages), with the aim of studying frequencies of the Italian lexicon used in television content and targets the specific sector of web applications for linguistic research. The corpus of transcriptions is constituted approximately by 170 hours of random television recordings transmitted by the national broadcaster RAI (Italian Radio Television) during the year 2006.
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
The evolution of information Technology has led to the collection of large amount of data, the volume of
which has increased to the extent that in last two years the data produced is greater than all the data ever
recorded in human history. This has necessitated use of machines to understand, interpret and apply data,
without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due
to the complexity are very difficult to analyze. The work already performed in this area is progressing at
great pace and this work hopes to be a way to push that work further. The designed system is an effort
which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using
supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based
classification; and results are compared to in order to decide which methodology is better suited in
handling of these documents. As we will see, the plain machine learning algorithm applications are just as
or in many cases are much better in performance than the more analytical approach.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...ijnlc
One of the main difficulties in sentiment analysis of the Arabic language is the presence of the
colloquialism. In this paper, we examine the effect of using objective words in conjunction with sentimental
words on sentiment classification for the colloquial Arabic reviews, specifically Jordanian colloquial
reviews. The reviews often include both sentimental and objective words; however, the most existing
sentiment analysis models ignore the objective words as they are considered useless. In this work, we
created tow lexicons: the first includes the colloquial sentimental words and compound phrases, while the
other contains the objective words associated with values of sentiment tendency based on a particular
estimation method. We used these lexicons to extract sentiment features that would be training input to the
Support Vector Machines (SVM) to classify the sentiment polarity of the reviews. The reviews dataset have
been collected manually from JEERAN website. The results of the experiments show that the proposed
approach improves the polarity classification in comparison to two baseline models, with accuracy 95.6%.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...CSCJournals
Machine translation (MT) is a subfield of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. Translating between English language and Yoruba language comes with some computational complexities such as syntactic and grammatical differences in the language pair. This paper aims at exploring a multi-layer hybridized language translation approach, which combines the Corpus-based and Rule-based approaches of machine translation to generate its outputs. A parallel corpus was built with texts from English and Yoruba languages and stored in My Structured Query Language (MySQL) database. One hundred and forty seven computational rules were manually formulated and also stored in MySQL database for generating sentences in both languages. A di-bilingual dictionary was developed, one of which stored words in English with their corresponding Yoruba counterparts and their equivalent parts of speech while the other dictionary stored words in Yoruba with their corresponding English counterparts and their equivalent parts of speech. A real time mobile chatting interface was developed for users’ interactions with themselves and the system. The research model was implemented using PHP for server-side scripting, JSON for data interchange and Java programming language for user interfaces accessible on users’ mobile phones. The Java programming language was coded in Android Studio 3.0 Integrated Development Environment. Two hundred and eleven sentences from Contemporary English Grammar were used for system testing and the result shows 95% accuracy compare with Google Translate.
Customer sentiment analysis for Arabic social media using a novel ensemble m...IJECEIAES
Arabic’s complex morphology, orthography, and dialects make sentiment analysis difficult. This activity makes it harder to extract text attributes from short conversations to evaluate tone. Analyzing and judging a person’s emotional state is complex. Due to these issues, interpreting sentiments accurately and identifying polarity may take much work. Sentiment analysis extracts subjective information from text. This research evaluates machine learning (ML) techniques for understanding Arabic emotions. Sentiment analysis (SA) uses a support vector machine (SVM), AdaBoost classifier (AC), maximum entropy (ME), k-nearest neighbors (KNN), decision tree (DT), random forest (RF), logistic regression (LR), and naive Bayes (NB). A model for the ensemble-based sentiment was developed. Ensemble classifiers (ECs) with 10-fold cross-validation out-performed other machine learning classifiers in accuracy (A), specificity (S), precision (P), F1 score (FS), and sensitivity (S).
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
DEVELOPMENT OF WEB APPLICATION FOR PACKAGING DESIGNijma
The majority of One Tambon One Product (OTOP) entrepreneurs desired a new packaging design that attracts the attention of consumers. The aims of this research were to 1) determine the packaging demands of entrepreneurs, 2) develop a conceptual framework for web applications, and 3) create web applications. Finally, 4) to ascertain entrepreneurs' satisfaction with the use of web applications in packaging design. The demographic and sample were recruited from the central region's population, entrepreneurs, and customers. Purposive sampling was used to choose 400 entrepreneurs and customers in Saraburi province. The main result was that requirement of entrepreneursabout package must be easy to portable. And Web Application must be also easy to use. By opinion of experts the result of web application development was overall high level and satisfaction of web application that help entrepreneurs to design package was high level. So the benefit of research is that entrepreneurs had web application to design the package and lower cost.
PATHS state of the art monitoring reportpathsproject
This document provides an update to an Initial State of the Art Monitoring report delivered by the project. The report covers the areas of Educational Informatics, Information Retrieval and Semantic Similarity relatedness.
Speech Recognition Application for the Speech Impaired using the Android-base...TELKOMNIKA JOURNAL
Those who are speech impaired (tunawicara in the Indonesian language) suffer from
abnormalities in their delivery (articulation) of the language as well their voice in normal speech, resulting
in difficulty in communicating verbally within their environment. Therefore, an application is required that
can help and facilitate conversations for communication. In this research, the authors have developed a
speech recognition application that can recognise speech of the speech impaired, and can translate into
text form with input in the form of sound detected on a smartphone. By using the Google Cloud Speech
Application Programming Interface (API), this allows converting audio to text, and it is also user-friendly to
use such APIs. The Google Cloud Speech API integrates with Google Cloud Storage for data storage.
Although research into speech recognition to text has been widely practiced, this research try to develop
speech recognition, specially for speech impaired's speech, as well as perform a likelihood calculation to
see the factor of tone, pronunciation, and speech speed in speech recognition. The test was conducted by
mentioning the digits 1 through 10. The experimental results showed that the recognition rate for the
speech impaired is about 80%, while the recognition rate for normal speech is 100%.
Quantitative And Qualitative Evaluation Of F/Oss Volunteer Participation In D...ijseajournal
Free/Open Source Software (F/OSS) is an incredible and innovative opportunity of software development
in the area of software engineering. An F/OSS project evolves by receiving submissions from various
sources to address different aspects of the project like bug identification, feature request, support request,
translation request, source code, documentation etc. The present paper delves into a multi-case study of
F/OSS projects to evaluate volunteer participation in defect management quantitatively as well as
qualitatively. The relevant defect data has been retrieved from a research collaboratory. It is found that
generally a small core team is surrounded by a large community of volunteers participating in defects. It is
observed that defect reporting is a widely dispersed activity mostly contributed by volunteers external to
core team making occasional contribution while defect resolution is concentrated among a few individuals
mainly from core team making regular contribution.
Hybrid approaches for automatic vowelization of arabic textsijnlc
Hybrid approaches for automatic vowelization of Arabic texts are presented in this article. The process is
made up of two modules. In the first one, a morphological analysis of the text words is performed using the
open source morphological Analyzer AlKhalil Morpho Sys. Outputs for each word analyzed out of context,
are its different possible vowelizations. The integration of this Analyzer in our vowelization system required
the addition of a lexical database containing the most frequent words in Arabic language. Using a
statistical approach based on two hidden Markov models (HMM), the second module aims to eliminate the
ambiguities. Indeed, for the first HMM, the unvowelized Arabic words are the observed states and the
vowelized words are the hidden states. The observed states of the second HMM are identical to those of the
first, but the hidden states are the lists of possible diacritics of the word without its Arabic letters. Our
system uses Viterbi algorithm to select the optimal path among the solutions proposed by Al Khalil Morpho
Sys. Our approach opens an important way to improve the performance of automatic vowelization of
Arabic texts for other uses in automatic natural language processing.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...ijnlc
One of the main difficulties in sentiment analysis of the Arabic language is the presence of the
colloquialism. In this paper, we examine the effect of using objective words in conjunction with sentimental
words on sentiment classification for the colloquial Arabic reviews, specifically Jordanian colloquial
reviews. The reviews often include both sentimental and objective words; however, the most existing
sentiment analysis models ignore the objective words as they are considered useless. In this work, we
created tow lexicons: the first includes the colloquial sentimental words and compound phrases, while the
other contains the objective words associated with values of sentiment tendency based on a particular
estimation method. We used these lexicons to extract sentiment features that would be training input to the
Support Vector Machines (SVM) to classify the sentiment polarity of the reviews. The reviews dataset have
been collected manually from JEERAN website. The results of the experiments show that the proposed
approach improves the polarity classification in comparison to two baseline models, with accuracy 95.6%.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Development of Bi-Directional English To Yoruba Translator for Real-Time Mobi...CSCJournals
Machine translation (MT) is a subfield of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. Translating between English language and Yoruba language comes with some computational complexities such as syntactic and grammatical differences in the language pair. This paper aims at exploring a multi-layer hybridized language translation approach, which combines the Corpus-based and Rule-based approaches of machine translation to generate its outputs. A parallel corpus was built with texts from English and Yoruba languages and stored in My Structured Query Language (MySQL) database. One hundred and forty seven computational rules were manually formulated and also stored in MySQL database for generating sentences in both languages. A di-bilingual dictionary was developed, one of which stored words in English with their corresponding Yoruba counterparts and their equivalent parts of speech while the other dictionary stored words in Yoruba with their corresponding English counterparts and their equivalent parts of speech. A real time mobile chatting interface was developed for users’ interactions with themselves and the system. The research model was implemented using PHP for server-side scripting, JSON for data interchange and Java programming language for user interfaces accessible on users’ mobile phones. The Java programming language was coded in Android Studio 3.0 Integrated Development Environment. Two hundred and eleven sentences from Contemporary English Grammar were used for system testing and the result shows 95% accuracy compare with Google Translate.
Customer sentiment analysis for Arabic social media using a novel ensemble m...IJECEIAES
Arabic’s complex morphology, orthography, and dialects make sentiment analysis difficult. This activity makes it harder to extract text attributes from short conversations to evaluate tone. Analyzing and judging a person’s emotional state is complex. Due to these issues, interpreting sentiments accurately and identifying polarity may take much work. Sentiment analysis extracts subjective information from text. This research evaluates machine learning (ML) techniques for understanding Arabic emotions. Sentiment analysis (SA) uses a support vector machine (SVM), AdaBoost classifier (AC), maximum entropy (ME), k-nearest neighbors (KNN), decision tree (DT), random forest (RF), logistic regression (LR), and naive Bayes (NB). A model for the ensemble-based sentiment was developed. Ensemble classifiers (ECs) with 10-fold cross-validation out-performed other machine learning classifiers in accuracy (A), specificity (S), precision (P), F1 score (FS), and sensitivity (S).
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
DEVELOPMENT OF WEB APPLICATION FOR PACKAGING DESIGNijma
The majority of One Tambon One Product (OTOP) entrepreneurs desired a new packaging design that attracts the attention of consumers. The aims of this research were to 1) determine the packaging demands of entrepreneurs, 2) develop a conceptual framework for web applications, and 3) create web applications. Finally, 4) to ascertain entrepreneurs' satisfaction with the use of web applications in packaging design. The demographic and sample were recruited from the central region's population, entrepreneurs, and customers. Purposive sampling was used to choose 400 entrepreneurs and customers in Saraburi province. The main result was that requirement of entrepreneursabout package must be easy to portable. And Web Application must be also easy to use. By opinion of experts the result of web application development was overall high level and satisfaction of web application that help entrepreneurs to design package was high level. So the benefit of research is that entrepreneurs had web application to design the package and lower cost.
PATHS state of the art monitoring reportpathsproject
This document provides an update to an Initial State of the Art Monitoring report delivered by the project. The report covers the areas of Educational Informatics, Information Retrieval and Semantic Similarity relatedness.
Speech Recognition Application for the Speech Impaired using the Android-base...TELKOMNIKA JOURNAL
Those who are speech impaired (tunawicara in the Indonesian language) suffer from
abnormalities in their delivery (articulation) of the language as well their voice in normal speech, resulting
in difficulty in communicating verbally within their environment. Therefore, an application is required that
can help and facilitate conversations for communication. In this research, the authors have developed a
speech recognition application that can recognise speech of the speech impaired, and can translate into
text form with input in the form of sound detected on a smartphone. By using the Google Cloud Speech
Application Programming Interface (API), this allows converting audio to text, and it is also user-friendly to
use such APIs. The Google Cloud Speech API integrates with Google Cloud Storage for data storage.
Although research into speech recognition to text has been widely practiced, this research try to develop
speech recognition, specially for speech impaired's speech, as well as perform a likelihood calculation to
see the factor of tone, pronunciation, and speech speed in speech recognition. The test was conducted by
mentioning the digits 1 through 10. The experimental results showed that the recognition rate for the
speech impaired is about 80%, while the recognition rate for normal speech is 100%.
Quantitative And Qualitative Evaluation Of F/Oss Volunteer Participation In D...ijseajournal
Free/Open Source Software (F/OSS) is an incredible and innovative opportunity of software development
in the area of software engineering. An F/OSS project evolves by receiving submissions from various
sources to address different aspects of the project like bug identification, feature request, support request,
translation request, source code, documentation etc. The present paper delves into a multi-case study of
F/OSS projects to evaluate volunteer participation in defect management quantitatively as well as
qualitatively. The relevant defect data has been retrieved from a research collaboratory. It is found that
generally a small core team is surrounded by a large community of volunteers participating in defects. It is
observed that defect reporting is a widely dispersed activity mostly contributed by volunteers external to
core team making occasional contribution while defect resolution is concentrated among a few individuals
mainly from core team making regular contribution.
Slides for online briefing on the OER Rapid Innovation Call released in November 2011: http://bit.ly/rNQsW3
Bid deadline 27th January 2012. Amber Thomas, JISC.
An unsupervised approach to develop ir system the case of urduijaia
Web Search Engines are best gifts to the mankind by Information and Communication Technologies.
Without the search engines it would have been almost impossible to make the efficient access of the
information available on the web today. They play a very vital role in the accessibility and usability of the
internet based information systems. As the internet users are increasing day by day so is the amount of
information being available on web increasing. But the access of information is not uniform across all the
language communities. Besides English and European languages that constitutes to the 60% of the
information available on the web, there is still a wide range of the information available on the internet in
different languages too. In the past few years the amount of information available in Indian Languages
has also increased. Besides English and few European Languages, there are no tools and techniques
available for the efficient retrieval of this information available on the internet. Especially in the case of
the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools
and techniques available for the efficient retrieval of the information for Indian Languages.
As we know that Indian Languages are very resource poor languages in terms of IR test data collection.
So my main focus was mainly on developing the data set for URDU IR, training and testing data for
Stemmer.
We have developed a language independent system to facilitate efficient retrieval of information available
in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and
the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU
Language [1] as it is very important in the Information Retrieval.
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...IJECEIAES
The heavy involvement of the Arabic internet users resulted in spreading data written in the Arabic language and creating a vast research area regarding natural language processing (NLP). Sentiment analysis is a growing field of research that is of great importance to everyone considering the high added potential for decision-making and predicting upcoming actions using the texts produced in social networks. Arabic used in microblogging websites, especially Twitter, is highly informal. It is not compliant with neither standards nor spelling regulations making it quite challenging for automatic machine-learning techniques. In this paper’s scope, we propose a new approach based on AutoML methods to improve the efficiency of the sentiment classification process for dialectal Arabic. This approach was validated through benchmarks testing on three different datasets that represent three vernacular forms of Arabic. The obtained results show that the presented framework has significantly increased accuracy than similar works in the literature.
مقدمة في بناء الأنطولوجيا باستخدام برنامج البورتجيHend Al-Khalifa
تعتبر الانطولوجيا (Ontology) أحد أهم مكونات الويب الدلالية (Semantic Web) وحجر الأساس لعمله، وتعرف على أنها طريقة لتمثيل المفاهيم من حولنا وذلك عن طريق الربط بينها بعلاقات ذات معنى، مما يساعد على فهم أوسع للمفاهيم المختلفة. بهذه الطريقة نستطيع جعل جهاز الحاسب الآلي يصل لمستوى فهم وإدراك للمعاني قريب من فهم وإدراك الإنسان.
وحتى نبني أنطولوجيا يدوياً لابد من أدوات مساعدة في ذلك، ومن أكثر الأدوات شيوعاً، برنامج بورتجي Protégé. في هذا الدرس المختصر سنشرح خطوة بخطوة بناء أنطولوجيا باللغة العربية لمجال معين باستخدام برنامج البورتجي.
وللعلم، فهذا الدرس يعتبر مدخل مبسط لهندسة الأنطولوجيا باستخدام البرنامج الآنف ذكره ولن يتطرق لكثير من التفاصيل الخاصة بأساسيات هندسة الأنطولوجيا. ولعل يكون هذا الدرس بإذن الله نواة لدليل تعليمي متكامل باللغة العربية (يحدث دورياً) للتعريف بمجال هندسة الأنطولوجيا وتطبيقاته المختلفة.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
3. 1. Executive Summary
This is the pre-final report of the state of art of the situation of Human Language Technologies (HLT)
for Arabic as drafted in December 2008. Following the work carried out within NEMLAR
(www.nemlar.org , see the web site for more details at:
(http://www.medar.info/The_Nemlar_Project/Publications/NEMLAR-REPORT-SURVEY-
FINAL_web.pdf)
This document aims at describing the work done with respect to surveying existing institutions and
experts involved in the development of Arabic Language Resources carried out in 2008 (since the
first report of NEMLAR). It surveys the activities and projects, existing language resources and tools,
important language resources and tools, as well as the experts. The Summary of the findings (facts &
figures) is given herein.
The survey was launched on April 2008 and all partners were encouraged filling the questionnaire for
their institution and having it filled by their partners. As of today 57 questionnaires were filled in.
Some of them have been entirely completed (37), some 17 questionnaires are missing some of the
answers but are still considered for their usefulness. A number of countries are well represented (e.g.
11 responses from Egypt, 10 from Morocco). A large number of players are listed for the first time
(e.g. a Syrian and a Turkish lab, compared to the NEMLAR report). We feel that the Internet-based
questionnaire was more easy to use than by the past (email of word files).
An important part of the survey is related to the technologies our respondents feel important for Arabic
and they listed a large set. Many of them consolidate our own finding listing MT, CLIR/MLIR, and
ASR on the top. They also listed a number of crucial resources that should be better specified and
defined by MEDAR in the framework of its updating of the BLARK.
In addition to the survey, the ELDA team also collected information about MT and CLIR/MLIR tools
and products that addresses Arabic as one of the languages. This is part of this report.
2. Introduction to the MEDAR Survey
This survey is carried out within the MEDAR project and aimed at providing an overview and an
analysis of the situation with respect to language technology for Arabic in the region. Although
MEDAR focuses on tools related to machine translation and information retrieval, the ultimate goal is
to draw an accurate knowledge base of the language technology players, projects (ongoing activities),
products etc.
So a survey was conducted by MEDAR partners using an online web questionnaire covering all
Mediterranean countries participating in the project, resulting in a knowledge base with details of all
universities, research institutions and companies, as well as ongoing projects, and existing products, -
with relation to tools and Language Resources (LRs), in particular for MT, information retrieval and
indexing. The partners, as far as possible, attempted to contact the players to collect information about
existing Arabic LRs and tools for Arabic.
In addition to the objective of updating the directory of players, resources, and tools, the survey aims
at identifying for the technologies mentioned above (MT, CLIR/MLIR) what is already available, and
where there are gaps, or tools or resources that have to be updated and improved in order to fit the
specifications.
Consequently, this work will provide a substantial part of the necessary basis for detailed work on
specifying, updating, or creating languages resources and tools for the MT and CLIR/MLIR with
Arabic language as one of the components.
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 3/40
4. 3. The MEDAR Online Survey and the questionnaire structure
In order to ensure a larger number of replies to the MEDAR survey, we opted for an online
questionnaire using a web based tool for interview called Limesurvey
(http://docs.limesurvey.org/), an open source survey tool that allows to set up surveys very user
friendly and also to collect the information in various format that render them very easy to analyze and
exploit. The tool allows also asking a question and continuing the questionnaire according to the
answer received (an easy "tree" interface).
The tool was easy to customize so the respondents were presented with questions group by group.
Responses were date stamped and IP Addresses have been logged (and Referrer-URL saved) for future
exploitation.
Participants could reply to survey in more than one visit if they wish and the tool saves partially
finished surveys.
The major challenge was to ensure that filling the questionnaire would not take more than 5mn. The
questionnaire was set up on the basis of 6 groups of questions and an introduction.
This is the introduction to the questionnaire and the questions (for more details please refer to
MEDAR report D2.1):
MEDAR Survey
MEDAR & NEMLAR
A Network for Euro-Mediterranean LAnguage Resource
and
Human Language Technology development and support
A Follow-up of a FIRST SURVEY ON HLT Experts
AND LANGUAGE RESOURCES
Conducted within the NEMLAR project in 2003
Dear Colleague,
Language Resources (LRs) are recognized as a central component of the linguistic infrastructure,
necessary for the development of Human Language Technologies (HLT), and therefore for industrial
development. Other purposes may be served by the availability of LRs such as content industry,
cultural heritage safeguarding, etc. The availability of adequate LRs for as many languages as
possible and, in particular, of multilingual LRs, is a pre-requisite for the development of a truly
multilingual Information Society.
The issue of HLT based on Arabic language is getting more and more prominent; the lack, on the one
hand, of useful resources, and, on the other hand, of real-world publicly available applications,
highlights the need for improving R&D in this area and for promoting the use of HLT among the
potential partners, in particular to safeguard some of the cultural heritages of this geographical area.
In many areas and business sectors, large companies produce their own resources for the languages
for which some business can be made, and often no resources are built for the less “lucrative”
languages.
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 4/40
5. In order to overcome such handicap, the NEMLAR project (February 2003 - July 2005) and its follow-
up MEDAR (February 2008 - August 2010) would like to ensure that Arabic language obtains the
necessary funds to produce the required resources and tools, and to make them widely available as for
many other major languages.
In order to have a better picture of the Arabic HLT scene, the MEDAR project would like to update the
data collected during the Nemlar project (Final Survey) and that helped promote the Arabic HLT in
academic and industrial world as well as vis-à-vis the potential funding agencies.
The goal of this new survey is to collect information about the existing institutions and Language
Resources, and to describe the needs for language resources, etc. This task is being implemented in
three phases.
The first phase aims to revise and update the data collected within NEMLAR (general information
about Language Resources & Tools for HLT within the members of the NEMLAR network who
contributed to the first report) This is the purpose of this first survey.
The second phase is to go beyond this first list and the basic information, contacting new institutions
recommended by the partners, and also detailing the descriptions of what has been identified in the
first phase, (players, products, Language Resources, needs and requirements).
The final phase will aim at drafting a comprehensive report that may serve as the basis for a work
plan about the needs for multilingual resources targeting customization of Machine Translation
(including speech to speech translation), Cross-lingual information Retrieval, and other speech
recognition tools. The ultimate goal is to commission some work to produce a Basic Kit that would
support such customization.
There are 63 questions but most of them are easy to address (Yes / No questions).
The structure of the questionnaire with the 6 groups is briefly described below:
Group 1 is the contact information (name, email, etc.)
The final question of this group is:
*You are answering this survey as an
Choose one of the following answers
Independent expert/Entrepreneur
Institution
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 5/40
6. Group 2: Information about your institution and its language technology
If you are not answering for an institution, please go back to the previous page using the
"<<prev" button on this page and select "Independent expert/entrepreneur" at the last question.
This group 2 consists of 16 questions including
"Number of employees (directly or indirectly) involved in language technologies"
"Your institution's main activity (you may choose more than one choice)"
"Is your institution involved in Language Technologies"
"If Speech technologies", please specify"
"If Written technologies", please specify"
"What are your institution's main products and/or services (please list)"
"Do they include the Arabic language",
etc.
Group 3: Information about your language resources
Please select as many boxes as appropriate, please list the Languages whenever appropriate (e.g.; the
ones containing Arabic) and Please add details about nature, size, etc. whenever appropriate and
possible e.g. for a corpus of business documents, you may state it consists of 2 million words, Arabic-
English dictionary, 50,000 entries, etc.)
This group (certainly the most important for our work on MEDAR) is subdivided into almost 20
questions, organized into a hierarchical "tree" like 3.1:
Language Resources type
Check any that apply
Speech Resources
Written Resources
Multimedia/multimodal Resources
Other:
If the respondent chooses only one type (e.g. written resources'), then only questions related to that
item are selected in the following sections of the questionnaire (see details in the annexes).
Group 4: Information and input about the Market:
This group with 5 questions aimed at getting some hints about the market targeted by the respondents.
The first question (4.1) is:
Are your products and/or services distributed and/or offered to the:
Check any that apply
Domestic market
Arabic world
International market
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 6/40
7. Question 4.2 is about partnership between the respondent institution and other players to identify
potential cooperations with the Arabic world organizations. A question was asked about the names of
such partners.
The following question (4.3) is also directly impacting our project as it asks about the financial plans
of our experts:
Do you purchase or plan to purchase Language Resources? (Euro/year)
Only numbers may be entered in these fields
How much do you spend for data acquisition :
How much do you spend for data production :
Group 5 is about the needs for LRs:
This is the crucial group of the future tasks of the project. It requests information about the
potential expectations of the surveyed expert if he/she had to decide (the answers are free texts)
5.1. Which Language Resources should be available?
5.2 For which applications?
5.3. Which design and structure of Language Resources in general would you prefer?
Group 6 is request for more contacts
Group 6 is a request to give us more contacts so we can circulate the questionnaire widely.
4. The MEDAR survey: summary of the figures and facts
Thanks to the involvement of all MEDAR partners we managed to obtain 54 responses for this survey
(37 full responses, 17 responses not completely filled out but still provide good information). The
results given below comprise the detailed number of responses to each question and the percentages
are computed on the basis of the 44 responses.
4.1. Identification of the respondent:
The 57 respondent are identifiable by name, first name, etc. The types of positions reported are listed
below in alphabetical order to highlight the quality of the respondent and thus the quality of the
responses obtained:
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 7/40
8. Position Nb of
respondents
Assistant professor 4
Associate Professor 2
Associate Research Scientist 1
CEO 6
Consultant of Human Language Technologies 1
Dean 2
Deputy Director 2
Director 3
Founder & Chief Scientist 1
General Manager 3
Head of Department 1
Human Language Technologies Group Manager 1
IT Instructor 1
Lab Technician 1
Laboratory Head 1
Lecturer of English & Arabic 1
Linguist 1
PhD student 2
Professor 9
Project Manager 1
Research Assistant 2
Researcher 3
Student 1
Teacher researcher 1
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 8/40
9. The countries from which originated the replies are given by this diagram:
The details per country are:
Answer Count Percentage
Egypt (EGY) 11 20.37%
Morocco (MAR) 10 18.52%
West Bank & Gaza Strip (PSE) 9 16.67%
Jordan (JOR) 3 5.56%
Czech Republic (CZE) 2 3.70%
France (FRA) 2 3.70%
Lebanon (LBN) 2 3.70%
Saudi Arabia (SAU) 2 3.70%
Spain (ESP) 2 3.70%
United States of America (USA) 2 3.70%
Algeria (DZA) 1 1.85%
Israel (ISR) 1 1.85%
Japan (JPN) 1 1.85%
Syrian Arab Republic (SYR) 1 1.85%
Turkey (TUR) 1 1.85%
No answer 4 7.41%
This item has to be interpreted considering the other items (topics of interest, position, etc.) to balance
the fact that some countries are over represented: some experts are interested by language technologies
and assume they would incorporate some in their own business but they do not claim to be active
players.
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 9/40
10. 4.2. Profile of the Respondent:
The profiles of the respondent were collected to ensure that we can distinguish independent experts
from institutions and also their involvement in HLT & LRs.
No answer 1 1.85%
Independent expert/Entrepreneur (1) 17 31.48%
Institution (2) 36 66.67%
Non completed 0 0
In addition to the individuals that replied without mentioning explicitly their institution, the following
ones were listed:
ACS TechnoCenter
AlKhawarizmy Language Software
Arab Academy for Banking and Financial. Sciences,
ARABIC TEXTWARE
Arabize;
Cairo Microsoft Innovation Center in Egypt (CMIC)
COLTEC
Columbia University Center for Computational Learning Systems
CRSTDLA (Scientific & technical Research Center for Arabic
Language Development)
ELDA
ENSIAS
European Trading & Technology ( eurotec )
Faculty of Computers and Information, Cairo University
France Telecom R&D ORANGE Labs
GIS Int.
Higher Institute for Applied Sciences and Technology (HIAST)
IBM (IBM Egypt)
Indiana University
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 10/40
11. Insan Center
Institute for the Studies and Researches on Arabization (Institut
d'Etudes et Recherches pour l'Arbisation)
Isra' Software & Computer Co. Ltd
IT College / Birzeit University
King Abdulaziz City for Science and Technology
Laboratoire de Recherche en Informatique et Telecommunications
Faculte des Sciences
Lebanese University
ManarahNet Modern Software Co.
Millennium Technology
RDI
TALP Research Center - Universitat Politecnica de Catalunya
The CJK Dictionary Institute
Unit for Learning Innovation- Birzeit University
University Cadi Ayyad - Faculty of Sciences, Marrakech, Morocco
Those who indicated the type of institution they work for listed the following:
Type of institution Answer Count
Company & for profit organisation (1) 15
University (2) 12
Public Research center (3) 7
Public organisation (4) 0
An important question about the number of employees in total versus those who are involved in
HLT:
Number of employees Answer Count
Less than 10 (1) 5
10-49 (2) 9
50-99 (3) 8
Over 100 (4) 9
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 11/40
12. When focusing on the HLT & LRs sectors we obtained the following answers:
Number of employees (directly or indirectly) involved in language
technologies
Answer Count
Less than 10 (1) 18
10-49 (2) 11
50-99 (3) 1
Over 100 (4) 1
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 12/40
13. The main activity of the institution (respondents could choose more than one choice):
Answer Count Percentage
Software developer (1) 12 22.22%
Teaching/training organisation (e.g. university) (2) 16 29.63%
HLT Product Vendor (3) 4 7.41%
Culture/Museum (4) 0 0
Technology Transfer institution (5) 9 16.67%
Minority language organisation (6) 0 0
Content provider (7) 3 5.56%
Interpreting/Translating/Localisation (8) 6 11.11%
Telecommunications (9) 2 3.70%
E-commerce (10) 2 3.70%
Banking/Insurance (11) 0 0
Other 9 16.67%
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 13/40
14. If we browser through the "Other" responses we have:
Center of Excellence for Data Mining and Computer Modeling (DMCM)
lexical databases development
Technology Development
Doing and supporting research within Saudi Arabia
Research activities
Database
legal databases
computers maintenance
It is important to stress the fact that a large number of key sectors (e-content,
Translation/interpretation, software integrator/developer) are represented.
4.3. Involvement of the players in HLT & LRs:
When asked about their involvement in HLT and LRs, we obtained:
Is your institution involved in Language Technologies
Answer Count Percentage
No answer 14 25.93%
Yes (Y) 22 40.74%
No (N) 5 9.26%
Non completed 13 24.07%
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 14/40
15. Those who responded positively to the question on their involvement in HLT indicated the following
areas (more than one answer):
Involvement in HLT and related sectors
Answer Count Percentage
Language learning (1) 6 11.11%
Language Resources production (2) 13 24.07%
Speech technologies (3) 11 20.37%
Written technologies (4) 14 25.93%
Search and knowledge mining (5) 13 24.07%
Translation automation (6) 7 12.96%
Other (Language Resources) 1 1.85%
The list of technologies as they were mentioned (with the duplicates) is listed herein:
LRS for speech technology development:
Text to Speech
ASR, TTS, Speech Verification for assisting in the self learning of spoken
language
Speech synthesis, speech recognition using open source
TTS, Speech Recognition
Speech Recognition and TTS
Automatic Speech Recognition, Text to Speech, Speaker Verification
Speech synthesis and speech recognition
Speech recognition, speech synthesis, machine translation
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 15/40
16. LRS for language (text) technology development
Morphological analysis, PoS tagging, phonetic transcription, lexical semantics,
text search, text mining, - Arabic omni font written OCR
Linguistic Processing
Morphology, syntax
OCR
Arabic Spell/Grammar Checking, Arabic-to-European language transliteration,
semantic analysis
Writing Scorer, Auther Verification, Mining, Translation
Online handwriting recognition
Spell Checker, Morphological analyzer
Natural Langiuage Processing, Machine Translation, Knowledge
Representation, Information extraction and question-answer systems
Semantics
Research
Infra structure for Arabic language esp. morphological, PoS taging, and
Semantic analysis layers serving for sharpening IR and TM over gigantic
Arabic content.
The main products and sector of activities as indicated by the different players are listed herein
(without rephrasing or summarizing them):
Language resources ; Technology Evaluation Services
Arabic Lexical Semantic Database - News Tracking System (www.Alzoa.com) - Tutor
for Arabic Hand Writing - Tutor for Arabic pronunciation (research in progress)
Arabic ASR (Labeeb) - Arabic TTS (ArabTalk) - Arabic Speech verification (Hafss) -
Arabic Morphological Analyzer (Arab Morpho) - Arabic PoS Tagger (Arab Tagger) -
Arabic Phonetic Transcriptor (Arab Diac) - Arabic Text Search Engine (Swift) - Arabic
Lexical Semantic Analyzer (ALSA) - Arabic omni font written OCR (Clever Page)
Arabic speech synthesis using diphones and "demi-syllable" Arabic speech recognition
using sphinx4
Araterm CD (multilingual terminological Database) - Aragen CD (general language
Database) - terminological lexicons (Banque de données terminologiques Dictionnaires
electroniques
Teaching/training/search/ learning
Works - (reviews) : linguistic Research
Arabic Speech Recognition Systems - Arabic TTS - A2E and E2A Machine Translation
Systems - Information Retrieval and Extraction
Arabic Speech recognition Arabic text to speech system Speech Databases , Arabic
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 16/40
17. name Romanization system
Accounting Programs Clinics Programs Employee Programs Auto Respond For
Telephone Software Web Site Development
Research activities
Name variant databases (Arabic, Japanese, Chinese) Phonological database (Japanese)
Place name databases (Japanese, Chinese) Orthographic databases (Japanese, Chinese)
Legal databases legal studies and researches
ICT private sector activity
e- enabled curricula Training Evaluation Research Networking
PhD program in Computational Linguistics
Information Retrieval (IR), Collaborative Content Services (CCS), Digital Content
Services (DCS): DCS is a set of Services built to take advantage of the recent increase
in digitized content and books, Research on Information Extraction, Natural Language
Processing and Information Retrieval.
4.4. Multilinguality issues
Another important issue is the Monolingual vs. Multilingual aspect of the products offered by the
respondent:
Are your products and or services:
Answer Count
Monolingual (1) 13
Multilingual (2) 22
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 17/40
18. When asked if they do they include the Arabic language, respondent replied:
Yes (Y) 28 51.85%
No (N) 0 0
Non completed 13 24.07%
No answer 13 24.07
4.5. Information about the respondent's LRs:
To the question on the Language Resources type we received the following answers:
Answer Id Count Percentage
Speech Resources (1) 19 35.19%
Written Resources (2) 28 51.85%
Multimedia/multimodal Resources (3) 10 18.52%
Other (e.g. biometric data) 2 3.70%
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 18/40
19. When asked for details about the types of LRs used by the respondent along modalities e.g. speech,
text lexica, text corpora, other modalities, we received the following answers (more than one answer):
If "Speech Resources" please select
Answer Count Percentage
Broadcast news & conversational speech (1) 13 24.07%
Fixed telephone (2) 7 12.96%
Mobile telephone (3) 9 16.67%
Micro/desktop speech (4) 8 14.81%
In-car recording (5) 7 12.96%
Read newspaper texts (6) 9 16.67%
Pronunciation/phonetic lexica (7) 8 14.81%
Other 2 3.70%
If "Written Resources", please select
Answer Count
Lexical databases (1) 20
Terminology and specialized dictionaries (2) 11
Text Corpora (3) 20
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 19/40
20. If "Lexical databases", please select
Answer Count
Monolingual lexical databases (1) 17
Multilingual lexical databases (2) 9
Onomastica (proper and geographical name lexical) (3) 4
If "Terminology and specialized dictionaries"
Answer Count
Monolingual terminology databases (1) 7
Multilingual terminology databases (2) 8
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 20/40
21. If "Text Corpora", please select
Answer Count
Monolingual text corpora (1) 14
Multilingual and parallel text corpora (2) 8
Multilingual and Aligned text corpora (3) 5
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 21/40
22. If "Multimedia/multimodal Resources", please select
Answer Count
Face (1) 6
Image (2) 8
Video (3) 9
Finger prints (4) 3
Other 1
Regarding the sources of the LRs used by the respondent:
Answer Count
that are produced internally ? (1) 25
that are produced by specific contracted vendors ? (2) 8
that are distributed by data centres ? (3) 15
Other 3
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 22/40
23. Those who replied to our question regarding production of resources were asked about the tools they
use to design and produce LRs, the following ones were mentioned:
Speech recording platforms
Lexical Semantic Database: data are acquired from linguistic experts then
represented and organized electronically using exiting Data Base Management
Systems.
Other resources are collected and organized by locally developed tools
Internal tools for written Arabic text annotation (Fassieh); morphogical, phonetic,
PoS tagging, semantic, ...
Internal tools for segmenting speech. (Some of them are built on HTK)
Cool Edit for the preparatory stages and the monitoring of speech signals.
MS Office & SQL server are suite is also used at some phases of the production of
both kinds of resources.
Arabic speech recognition using open source tools
Scanner hp Printer hp 1100
Matlab
Audio recording and editing public tools
HTK HMM
IBM Internal tools
TrEd - TrEd is a fully customizable and programmable graphical editor and viewer
for tree-like structures (http://ufal.mff.cuni.cz/~pajas/tred/index.html)
Off the shelf tools esp. MS-Office suite esp. MS-Word and MS-Access.
Plain text editors like MS-Windows's NotePad.
MatLab, Praat, Delphi, Java, PHP/MySql
lexicon, structure analysis
Internal text annotation tool.
Gwave
MS-Office tools. Translation memories.
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 23/40
24. When asked about the standards & best practices they follow to design and produce LRs, the replies
are:
Do you follow specific standards?
Answer Count
None (1) 8
Internal specifications (2) 25
External specifications (3) 5
4.6. Some Market figures
The survey also asked questions about the respondents' visions on their needs and on the market.
When asked how often they review their needs for LRs:
How often do you re-evaluate your Language Resources needs and seek
available databases?
Answer Count
Monthly (1) 1
Once per quarter (2) 5
Once per semester (3) 3
Once per year (4) 15
Once every 1-2 years (5) 5
Never (6) 7
Other 2
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 24/40
25. Regarding the important issue of distribution, we got:
Would you be willing to make your resources available to others according to a negotiated
distribution agreement?
Answer Count Percentage
No answer 31 57.41%
Yes (Y) 19 35.19%
No (N) 1 1.85%
Non completed 3 5.56%
And those who answered positively to the question on distribution, were very specific on whom they
would agree to supply their data:
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 25/40
26. If "Yes", whom would you be ready to license your Language
Resources to?
Answer Count
End-users (1) 10
Tool developers (2) 13
Researchers (3) 12
And when they did answer No to the question related to distribution, only one respondent gave the
reasons behind that as "strategic".
Is/was your institution involved in any Language Resources project?
Answer Count Percentage
No answer 22 40.74%
Yes (Y) 12 22.22%
No (N) 17 31.48%
Non completed 3 5.56%
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 26/40
27. If "Yes", is/was it a project aiming at
Answer Count
Language resources production for your own use,
please specify the causes, aims etc: (1)
6
Language resources packaging for others, please
specify: (2)
4
Other, please specify: (3) 4
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 27/40
28. Are your products and/or services distributed and/or offered to the
Answer Count
Domestic market (1) 17
Arabic world (2) 14
International market (3) 18
Regarding the plans for purchasing LRs, the budgets seem to be steady over the next a few years:
How big is your expected purchasing budget for Language Resources? (Euro/year)
Calculation Now 3-5 Years
Nb of respondents 9 10
Average 10100 12950
Maximum 40000 50000
When asked about the LRs market, the replies are hardly exploitable.
The replies were:
- In Egypt (Few tens of Millions of USD; there are many producers with few purchasers)
- In the Arab World (few milliards of USD) (!! Probably billions?)
- Talking about Arabic 1000000
- Growing rapidly and will improve, since the whole world is communicating fast through
the internet.
- International LR's market is a tens-of-Billion of USD's market. This market in the middle-
east is very tiny compared to such numbers (perhaps just few millions of USD's)
- We estimate to need more multilingual aligned parallel corpora
- No reliable estimation is available at my reach
- No accurate idea. But my impressions about any market of Arabic LRs are to be quite
small.
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 28/40
29. 5. The most important question was about the needs for LRs:
We list herein the replies categorized into speech, lexica (inc. Wordnet), corpora,
multimedia/multimodal, and tools:
Speech:
• Arabic conversational speech
• Multi-speaker colloquial/formal Arabic speech DB for speaker independent
small vocabulary ASR (office environment Speech + revised phonetic
transcription); 25,000 sentences over > 350 or speakers.
• Male and female speakers concatenative Arabic TTS data bases; (3,000
sentences over 4 hours clear speech + Electric Glottogram (EGG) signal +
revised phonetic transcription + revised phonetic segmentation).
• corpus for acoustic models corpus for language models
• Speech Resources for Arabic dialects and the Amazigh language
• Speech recordings in cars
Lexica, wordnets, …
• wordnets, Full scale Arabic WordNet –
• Arabic Verb Classes à la Beth Levin
• Validated comprehensive Arabic lexicon.
• Validated lexical semantics of Arabic multi-domain large text corpus (>
500K words) along with a standard formalism. (Arabic Lexical Semantics
set and hierarchy).
• Arabic (general language) Terminological resources
• morphosyntaxic-lexicons for Arabic words
• Lexicon, ontologies,
• Thesauri.
• Arabic proper names dictionaries.
Corpora: monolingual, bi/multilingual, various annotations
• text corpora special jargon
• Segmented Arabic Hand written corpus
• Parallel Corpora
• Idiomatic Databases and Corpora
• Validated morphologically analyzed Arabic multi-domain large text corpus
(> 500K words) along with a standard formalism (Morphological model).
• Validated POS tagged Arabic multi-domain large text corpus (> 500K
words) along with a standard formalism (Arabic POS tags set and tags
vector model).
• Validated phonetically transcribed Arabic multi-domain large text corpus
(> 500K words) along with a standard formalism (Arabic Phonetic
Grammar).
• Parallel corpora for the language pairs (Arabic-other languages)
• Parallel bi- and multilingual corpora
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 29/40
30. • Arabic Resources needed to learn the basic information of the language
• Arabic- English (in an ideal world, millions of sentences for every
language and every dialect, annotated on all levels)
• Written LRs consisting of gigantic corpora labeled as per proper names,
terminologies, ...,
• Validated labeled printed Arabic font-written text images corpus.
• Evaluation corpora
Multimedia/multimodal:
• Multimodal LRs consisting of AudioVisuals + content-rich textual material
(emails, blogs, Wikis, Forums).
• multimedia
Tools:
• Basic tools for the (text and speech) processing of the least processed
languages (Amazigh language and other languages spoken in Africa)
• A baseline of the NLP infrastructure LRs with Phonetically,
morphologically, syntactically, semantically, proper nouns/named-entities
Other
• Arabic printed omni font database
• Arabic grammar
And the corresponding applications (not prioritized yet though MT, CLIR/MLIR and ASR are
mentioned many times):
Arabic language learning for non native speakers
Information Retrieval + Text Mining.
Applications using the semantic level
ASR (speech recognition)
Bilingual News Tracking
CALL
Computer software
Discourse analysis with dictionary use
Document Management Systems (DMS). With a special focus on Arabic
within either monolingual or multilingual applications.
E-learning
Handwriting Recognition
IR and text search engines, web and search engine applications
Knowledge Mining
Language labs
Linguistics developing parsers
Morphology,
MT,
Omni font written OCR.
Part-of-speech tagging.
Question/Answering,
Screen Readers for the blind or visually impaired people.
SD-retrieval,
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 30/40
31. Spell checkers
Speech Verification.
Text mining.
Tools for (CCS) Collaborative Content Services
TTS.
Tutorial/e-learning of written/spoken languages.
Tutors for Teaching Handwriting
Voice Car navigation systems
A global platform was also mentioned as: For all NLP applications esp. Text Mining & IR,
Text-to-Speech, OCR, MT & MAT, and Language Learning.
6. Analysis of the online survey
From the survey we can at least extract information regarding the key players, the "hot" topics and
applications, and the needed resources. Such findings can reinforce our assumptions as stated in the
technical annex of the project:
6.1. A short list of key applications as reported within the survey:
The key technologies seem to be MT, ASR and CLIR/MLIR. Others were also frequently listed
(including some applications based on a combination of technologies) among which:
Information Retrieval + Text Mining.
ASR (speech recognition)
Bilingual News Tracking
IR and text search engines, web and search engine applications
Part-of-speech tagging and Parsers
Spoken Document Retrieval
But as the MEDAR project focuses on multilingual tools, we will concentrate on MT and
CLIR/MLIR.
6.2. A short list of key resources as indicated by the respondents:
The short list per type of resources:
Speech:
• Arabic conversational speech
• Speech Resources for Arabic dialects and the Amazigh language
• Speech recordings in cars
Lexica, wordnets, …
• wordnets, Full scale Arabic WordNet
• Validated comprehensive Arabic lexicon.
• Validated lexical semantics of Arabic multi-domain large text corpus (>
500K words) along with a standard formalism. (Arabic Lexical Semantics
set and hierarchy).
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 31/40
32. MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 32/40
Corpora: monolingual, bi/multilingual, various annotations
• Text corpora special jargon
• Parallel Corpora
• Validated morphologically analyzed Arabic multi-domain large text corpus
(> 500K words) along with a standard formalism (Morphological model).
• Parallel corpora for the language pairs (Arabic-other languages)
• Arabic- English (in an ideal world, millions of sentences for every
language and every dialect, annotated on all levels)
• Written LRs consisting of gigantic corpora labeled as per proper names,
terminologies, ...,
7. Additional findings by ELDA and the MEDAR partners
Following the analysis of the questionnaire, ELDA is working on a specific interview form to collect
more detailed information from Partners and those who replied to the survey regarding MT and
CLIR/MLIR tools and systems that could be made available.
While doing this ELDA has collected information from various sources such as LREC proceedings,
European funded projects, Evaluation campaigns workshops (e.g. NIST/MT, Evalda-CESTA, etc.).
7.1. MT findings
The first result of this information collection is given in the following matrix that lists all identified
MT systems that includes Arabic either as a source or a target language:
33. Source of the
"product" Product name
Arabic-
>English
Arabic-
>French
Arabic-
>Spanish
English-
>Arabic
French-
>Arabic
Spanish-
>Arabic
Commercial Ajeeb x x
Commercial Al Misbar x
Commercial Al Mutarjim Al Arabey x x
Commercial Al-Wafi x x
Commercial Ambassador x x
Commercial Angusman’s Translator x
Commercial An-Nakel El-Arabi x x x x
Commercial
Applied Language
Solutions x x
Commercial BBN Technologies x
Commercial Golden al-Wafi x x
Commercial Google x x
Commercial IBM x
Commercial Interpret x
Commercial Johaina x
Commercial Language Weaver SMTS x x x x x x
Commercial LEC x x
Commercial LEC Passport Premium x x x x x x
Commercial LEC Translate DotNet x x x x x x
Commercial Maximum Edge x x
Commercial MITRE Corporation x
Commercial MutarjimNet x x
Commercial
Sakhr Enterprise
Translation x x
Commercial Systran x x
Commercial Tarjim x x
Commercial Transclick x x
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 33/40
34. MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 34/40
Commercial Translate-Net x x x x
Commercial Translution x x
Commercial TranSphere x x
Commercial WebTrans x x
Commercial Windows Live Translator x x
Academic Fitchburg State College x
Academic Johns Hopkins University x
Academic
Queen Mary University of
London x
Academic
RWTH University of
Aachen x
Academic
Technical University of
Catalonia (UPC) x
Academic
U.S. Army Reasearch
Laboratory x
Academic Université du Maine x
Academic University of Cambridge x
Academic University of Edinburgh x
Academic University of Maryland x
Academic
University of Southern
California, Information
Science Institute x
35. A number of tools are also considered important by the community; these were also identified and are
listed herein:
Name
Morphological Analyzers:
ArabMorpho
Xerox Arabic Morphological Analyser
Raramorph
Buckwalter Arabic Morphological Analyser
Sebawai
Morphological Analyser (CRL, New Mexico State
University)
Stemmer Al-Stem
Light10
Larkey
POS Tagger ArabTagger
MorphTagger
Stanford Log-linear Part-Of-Speech Tagger
Brill's POS tagger for Arabic
Parser Stanford Arabic Parser
Statistical Machine Translation
Toolkit
Egypt
Syntactic Analyzer Syntactic Analyser (Cimos)
Finally, ELDA identified a number of LRs that could be used within the project activities to
train the selected tools or to better tune them to Arabic and the given domains:
Type Name
Dictionaries: Al-Misbar
Al-Wafi Quick Dictionary
ATA-NTS
Babylon-Pro
FreeDict
LingvoSoft
Pan Images
Partner
PocketTran
TranslationBooth
WordPoint
Xpro7
ArabDictions
Sakhr Multilingual Dictionary
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 35/40
36. Parallel Corpora UN Bidirectional Multilingual (En, Fr, Ar, Ru, Zh)
UNESCO
Hebrew-Arabic-English corpus (Agava Institute)
EGYPT Gizza Toolkit Quran Parallel Corpus (Ar-En)
CLARA (Corpus Linguae Arabicae) (Ar-Cz)
Bilingual aligned corpora (Ar-It, ILC)
Umaah Arabic English Parallel News Text
Arabic-English Parallel Translation (LDC)
10k words AFP Arabic Newswire corpus translated into English (LDC)
Euradic (Ar-Fr)
E-A Parallel Corpus (University of Kuwait)
Bilingual Corpora Multiple Translation Arabic (LDC)
TDT4 Multilanguage Corpus (LDC)
Evaluation corpora Arcade II Evaluation Package
CESTA Evaluation Package
7.2. CLIR/MLIR findings
Regarding the CLIR and/or MLIR, ELDA has identified the following tools and resources. A
deep analysis of CLEF & TREC campaigns will be conducted to obtain more data.
Tools: Product
Text Search Engine Swift
Google
Yahoo
4Arabs
Ayna
Arabo
Yamli
MSN
Exalead
IDRISI (Sakhr)
URSA
MG System
Araby
Question Answering AQAS
Resources:
Monolingual Corpora Agence France Presse (LDC, ELRA)
Al-Hayat Arabic Corpus (ELRA)
An-Nahar Arabic corpus 5ELRA)
Leuven Corpus
Nijmegen Corpus
DINAR corpus
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 36/40
37. General Scientific Arabic Corpus
Classical Arabic Corpus
SOTETEL
Corpus of Contemporary Corpus
Treebank Penn Arabic Treebank
Among the work to be done after this information collection is to select a short list of tool kits and
resources. This will be very crucial as it impact the whole project. After that the partners will have to
conduct all the tasks described above (customization, production of adequate LRs for training and
testing, etc.).
8. List of players
This is a directory of players involved in Arabic Human Language technology activities and projects.
We have collected such information from projects such as Oriental, NEMLAR, CESTA, MEDAR as
well as from contributions to workshops/conferences on Arabic.
This first draft is a list of institutions and individual experts. A coming version will elaborate on
profiles and sector of activities.
8.1. Institutions
Al-Ahlya Amman University –Faculty of Information Technology, Jordan
ACS TechnoCenter, Morocco
AlKhawarizmy, Egypt
AMRA Information Technology, West Bank & Gaza Strip
Arabic Textware, Jordan
Arabize, Egypt
Bank of Jordan, Jordan
Birzeit University – Birzeit Information technology UNIT (BIT) & Arabic Department, West Bank &
Gaza Strip
Cairo Microsoft Innovation Center in Egypt (CMIC)
Catholic University Leuven (KUL), Belgium
CEA -LIST/DTSI/SRSI/Laboratoire d’ingénierie de l’information multimédia multilingue, France
Cimos, France
CJK Dictionary Institute, Japan
CNRS – Centre Nationale de la Recherche Scientifique - Délégation Rhône-Alpes, Site Vallée du
Rhône, France
Coltec, Egypt
Columbia University Center for Computational Learning Systems, USA
CRSTDLA (Scientific & technical Research Center for Arabic Language Development), Algeria
DEEC-FECU – Department of Electronics and Electrical Communications, Faculty of Engineering,
Cairo University, Egypt
ELDA, France
MLTC, Morocco
ENSIAS – University of Mohammed V Soussi - Ecole Nationale Supérieur d´informatique et
d´analyse des Systèmes, Morocco
ESLE – The Egyptian Society of Language Engineering, Egypt
European Trading & Technology ( eurotec ), West Bank & Gaza Strip
Faculty of Computers and Information, Cairo University, Egypt
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 37/40
38. FCIS – Faculty of Computer & Information Sciences, Egypt
France Telecom R&D ORANGE Labs, France
GIS Int., Israel
Hariri Canadian Academy of Sciences and Technologies, Canada
Higher Institute for Applied Sciences and Technology (HIAST), Syria
IBM Egypt’s branch, Egypt
ILSP –Institute for Language and Speech Processing, Greece
IMAGINET, Egypt
Indiana University, USA
Insan Center, Saudi Arabia
'Isra Software and Computer Co., West Bank & Gaza Strip
Institute for the Study and Research on Arabisation, Morocco
Istituto di Linguistica Computazionale – CNR – Italy
IT College / Birzeit University, West Bank & Gaza Strip
Jinny Paging company, Lebanon
King Abdulaziz City for Science and Technology, Saudi Arabia
Laboratoire de Recherche en Informatique et Telecommunications Faculte des Sciences, Morocco
LibanCell company, Lebanon
Lyon2 – Université Lumière Lyon 2 Faculté des Langues, France
ManarahNet Modern Software Co., West Bank & Gaza Strip
Millenium Software S.A.L., Lebanon
Millennium Technology, Israel
RDI – The Engineering company for computer systems development, Egypt
Sakhr, Kuwait (& Egypt)
SOTETEL – Information Technology – Société Tunisienne d´Entreprises de Télécommunications,
Tunisia
Systran, France
TALP Research Center - Universitat Politecnica de Catalunya, Spain
The Arab academy for Sciences and Technology, Egypt
The Egyptian Society for the Arabisation of Science, Egypt
University of Maryland, College Park, United States
UOB – University of Balamand – Department of Computer Engineering, Lebanon
SDU – University of Southern Denmark, Denmark
Unit for Learning Innovation- Birzeit University, West Bank & Gaza Strip
University Cadi Ayyad - Faculty of Sciences, Morocco
Xerox Research Centre Europe, USA
8.2. Individuals
Ramzi Abbès, France
Abdelhamid El Jihad, Morocco
Abdelmajid Benhamadou, Tunisia
Muhammad Afeefi, Egypt
AlAli Kanan, West Bank & Gaza Strip
Fawaz Al-Anzi, Kuwait
Nashat Al-Aqtash, West Bank & Gaza Strip
Rami Al-Hajj Mohamad, Lebanon
Saleh Arar, West Bank & Gaza Strip
Ken Beesley, USA
Zied Ben Tahar, Tunisia
Aderrahim Benabbou, Morocco
Yassine Benajiba, Spain
Mohammed Benkhalifa, Morocco
Viktor Bielicky, Czech Republic
Christian Boitet, France
Malek Boualem, France
Karim Bouzoubaa, Morocco
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 38/40
39. Chris Brew, USA
Tim Buckwalter, USA
María J. Castro-Bleda, Spain
Violetta Cavalli-Sforza, USA (/Morocco)
Achraf Chalabi, Egypt
Noureddine Chenfour, Morocco
Gerard Chollet, France
Khalid Choukri, France
Christopher Cieri, USA
Daoud Maher Daoud, Jordan (formerly France)
Fathi Debili, France
Mona Diab, USA
Joseph Dichy, France
Everhard Ditters, Netherlands
Said El Hassani, Morocco
Bouyakhf El Houssine, Morocco
Khaled Elghamry, Egypt
Mohamed El-Mahallawy, Egypt
Salwa Elramly, Egypt
Ossama Emam, Egypt
Mohammed Erradi, Morocco
Salvador España-Boquera, Spain
Aly Fahmy, Egypt
Mohamed Waleed Fakhr, Egypt
Ali Farghaly, USA
Abdelkader Fassi-Fehri, Morocco (& UK)
Nagy Fatehy, Egypt
José A. R. Fonollosa, Spain
Jean-Luc Gauvain, France
Wasel Ghanem, West Bank & Gaza Strip
Antoine Ghaoui, Lebanon
Gregory Grefenstette, France
Ahmed Guessoum, UAE
Nizar Habash, USA
Lamia Hadrich Belguith, Tunisia
Jan Hajic, Czech Republic
Sonia Halimi, Switzerland
Salwa Hamada, Egypt
Isan Hamayel, West Bank & Gaza Strip
Abdelfattah Hamdani, Morocco
Mohamed Hassoun, France
Ihab Jabari , West Bank & Gaza Strip
Haddar Kais, France
Reem Kanjawi-Faraj, USA
Iveta Kourilova, Czech Republic
Jakub Kracmar, Czech Republic
Abouenour Lahcen, Morocco
Azzeddine Lazrek, Morocco
Mohamed Maamouri, USA
BenteMaegaard, Denmark
Abdel. Messaoudi, France
Outahajala Mohamed, Morocco
Emad Mohamed, USA
Chafic Mokbel, Lebanon
Abdelhak Mouradi, Morocco
Fiyad Odeh, West Bank & Gaza Strip
Martine Petrod, Denmark
Ghassan Qadan, West Bank & Gaza Strip
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 39/40
40. Tajj-eddine Rachidi, Morocco
Ahmed Ragheb, Egypt
Owen Rambow, USA
Mohsen Rashwan, Egypt
Horacio Rodríguez, Spain
Paul Roochnick, USA
Mike Rosner, Malta
Paolo Rosso, Spain
Salim Roukos, USA
Jean Senellart, France
Khaled Shaalan, UAE
Mohammed Shtayyah, West Bank & Gaza Strip
Otakar Smrz, Czech Republic
Abdelhadi Soudi, Morocco
Emna SOUISSI, Tunisia
Dekai Wu, Hong Kong
Mustafa Yaseen, Jordan
Tawfiq Yazidy, Morocco
Francisco Zamora-Martínez, Spain
Rached Zantout, Canada
9. Contributions to fulfilling the remaining "gaps" as defined by
MEDAR
This survey has focused essentially on identifying the players, LRs and Tools. The LRs and the tools
are those that could be part of the BLARK for MT & CLIR/MLIR. This survey has identified a large
set of requested resources and a few available ones. The following important task is to list the LRs &
Tools identified during this survey phase, drawing conclusions about which items are usable and
which are not. MEDAR will also prioritize these items according to the BLARK as defined by
NEMLAR both in terms of importance and availability.
Although, the BLARK concept was introduced to serve as a support for pre-competitive activities by
researchers, developers, integrators, educators, etc. and not as a direct basis for commercial
applications, it is important to pave the way to several levels of systems with various performances
and with different requirements if this can be achieved by available resources and open source
systems. Our primary target is to specify and try to fulfill requirements of the precompetitive R&D
activities that may indirectly lead to commercial products or services.
10. Appendix A: The NEMLAR REPORT
This report is available at:
http://www.medar.info/The_Nemlar_Project/Publications/NEMLAR-REPORT-SURVEY-
FINAL_web.pdf
MEDAR MEDAR /Survey/ Arabic LR Date: 08/04/09 Page 40/40