Most natural language processing systems use stemmer as a separate module in their architecture. Specially, it is very significant for developing, machine translator, speech recognizer and search engines. In this thesis work, a stemming system for Afan Oromo is presented. This system takes as input a word and removes its affixes according to a rule based algorithm. The result of the study is a prototype context sensitive iterative stemmer for Afan Oromo. Error counting technique was employed to evaluate the performance of this stemmer. The errors were analyzed and classified into two different categories: under stemming and over stemming errors. For testing purpose corpus which is collected from different public Afaan Oromo newspapers and bulletins is used. Newspapers, bulletins and public magazines are considered as consisting different issues of the community: social, economical, technological and political issues. This will reduce the probability of making the corpus biased toward some specific words that do not appear in everyday life. to make the testing set balanced. According to the evaluation of the experiments, it can be concluded that an overall accuracy of the stemmer is encouraging which shows stemming can be performed with low error rates in high inflected languages such as Afan Oromo.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
An implementation of apertium based assamese morphological analyzerijnlc
Morphological Analysis is an important branch of linguistics for any Natural Language Processing Technology. Morphology studies the word structure and formation of word of a language. In current scenario of NLP research, morphological analysis techniques have become more popular day by day. For processing any language, morphology of the word should be first analyzed. Assamese language contains very complex morphological structure. In our work we have used Apertium based Finite-State-Transducers for developing morphological analyzer for Assamese Language with some limited domain and we get 72.7% accuracy
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
In this article, a spell corrector has been designed for the Hausa language which is the second most spoken language in Africa and do not yet have processing tools. This study is a contribution to the automatic processing of the Hausa language. We used existing techniques for other languages and adapted them to the special case of the Hausa language. The corrector designed operates essentially on Mijinguini’s dictionary and characteristics of the Hausa alphabet. After a brief review on spell checking and spell correcting techniques and the state of art in the Hausa language processing, we opted for the data structures trie and hash table to represent the dictionary. The edit distance and the specificities of the Hausa alphabet have been used to detect and correct spelling errors. The implementation of the spell corrector has been made on a special editor developed for that purpose (LyTexEditor) but also as an extension (add-on) for OpenOffice.org. A comparison was made on the performance of the two data structures used.
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Phonetic Dictionary for Natural Language Processing: KannadaIJERA Editor
India has 22 officially recognized languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. Clearly, India owns the language diversity problem. In the age of Internet, the multiplicity of languages makes it even more necessary to have sophisticated Systems for Natural Language Process. In this paper we are developing the phonetic dictionary for natural language processing particularly for Kannada. Phonetics is the scientific study of speech sounds. Acoustic phonetics studies the physical properties of sounds and provides a language to distinguish one sound from another in quality and quantity. Kannada language is one of the major Dravidian languages of India. The language uses forty nine phonemic letters, divided into three groups: Swaragalu (thirteen letters); Yogavaahakagalu (two letters); and Vyanjanagalu (thirty-four letters), similar to the vowels and consonants of English, respectively.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
Speech is the important mode of communication and is the current research
topic. The concentration is mostly focused on synthesis and analyzing part.Apart of
synthesizing, text to speech system is developed.Speech synthesis is an artificial production
of human speech.A text to speech system (TTS) is to convert an arbitrary text into speech.In
India different languages have been spoken each being the mother tongue of tens of millions
of people.In this paper,the text to speech system is primarily developed for Telugu, a
Dravidian language predominantly spoken in Indian state of Andhra Pradesh.The
important qualities expected from this system are naturalness and intelligibility.Telugu TTS
can be developed using other synthesis methods like articulatory synthesis,formant synthesis
and concatenative synthesis.This paper describes a development of a Telugu text to speech
system using concatenative synthesis method on mobile based system OMAP 3530 (ARM
Cortex A-8 core) in Linux.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
Voice browsing requires speech interface framework. Pronunciation Lexicon Specification (PLS) 1.0 is a recommendation of Voice Browser Working Group of W3C (World-Wide Web Consortium), a machine-readable specification of pronunciation information which can be used for speech technology development. This global PLS standard is applicable across European and Asian languages and this specification is extendable to all human languages. However, it currently does not cover morphological, syntactic and semantic information associated with pronunciations. In Indian languages, grammatical information is relatively encoded in its morphology, than syntax unlike English where the grammatical information is an integral part of syntax. In this paper, PLS 1.0 has been examined from the perspective of augmentation of prosodic features of Punjabi such as tone, germination etc.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Phonetic Dictionary for Natural Language Processing: KannadaIJERA Editor
India has 22 officially recognized languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. Clearly, India owns the language diversity problem. In the age of Internet, the multiplicity of languages makes it even more necessary to have sophisticated Systems for Natural Language Process. In this paper we are developing the phonetic dictionary for natural language processing particularly for Kannada. Phonetics is the scientific study of speech sounds. Acoustic phonetics studies the physical properties of sounds and provides a language to distinguish one sound from another in quality and quantity. Kannada language is one of the major Dravidian languages of India. The language uses forty nine phonemic letters, divided into three groups: Swaragalu (thirteen letters); Yogavaahakagalu (two letters); and Vyanjanagalu (thirty-four letters), similar to the vowels and consonants of English, respectively.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
Speech is the important mode of communication and is the current research
topic. The concentration is mostly focused on synthesis and analyzing part.Apart of
synthesizing, text to speech system is developed.Speech synthesis is an artificial production
of human speech.A text to speech system (TTS) is to convert an arbitrary text into speech.In
India different languages have been spoken each being the mother tongue of tens of millions
of people.In this paper,the text to speech system is primarily developed for Telugu, a
Dravidian language predominantly spoken in Indian state of Andhra Pradesh.The
important qualities expected from this system are naturalness and intelligibility.Telugu TTS
can be developed using other synthesis methods like articulatory synthesis,formant synthesis
and concatenative synthesis.This paper describes a development of a Telugu text to speech
system using concatenative synthesis method on mobile based system OMAP 3530 (ARM
Cortex A-8 core) in Linux.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Grapheme-To-Phoneme Tools for the Marathi Speech SynthesisIJERA Editor
We describe in detail a Grapheme-to-Phoneme (G2P) converter required for the development of a good quality
Marathi Text-to-Speech (TTS) system. The Festival and Festvox framework is chosen for developing the
Marathi TTS system. Since Festival does not provide complete language processing support specie to various
languages, it needs to be augmented to facilitate the development of TTS systems in certain new languages.
Because of this, a generic G2P converter has been developed. In the customized Marathi G2P converter, we
have handled schwa deletion and compound word extraction. In the experiments carried out to test the Marathi
G2P on a text segment of 2485 words, 91.47% word phonetisation accuracy is obtained. This Marathi G2P has
been used for phonetising large text corpora which in turn is used in designing an inventory of phonetically rich
sentences. The sentences ensured a good coverage of the phonetically valid di-phones using only 1.3% of the
complete text corpora.
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
Voice browsing requires speech interface framework. Pronunciation Lexicon Specification (PLS) 1.0 is a recommendation of Voice Browser Working Group of W3C (World-Wide Web Consortium), a machine-readable specification of pronunciation information which can be used for speech technology development. This global PLS standard is applicable across European and Asian languages and this specification is extendable to all human languages. However, it currently does not cover morphological, syntactic and semantic information associated with pronunciations. In Indian languages, grammatical information is relatively encoded in its morphology, than syntax unlike English where the grammatical information is an integral part of syntax. In this paper, PLS 1.0 has been examined from the perspective of augmentation of prosodic features of Punjabi such as tone, germination etc.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
Designing A Rule Based Stemming Algorithm for Kambaata Language TextCSCJournals
Stemming is the process of reducing inflectional and derivational variants of a word to its stem. It has substantial importance in several natural language processing applications. In this research, a rule based stemming algorithm that conflates Kambaata word variants has been designed for the first time. The algorithm is a single pass, context-sensitive, and longest-matching designed by adapting rule-based stemming approach. Several studies agree that Kambaata is a strictly suffixing language with a rich morphology and word formations mostly relying on suffixation; even though its word formation involves infixation, compounding and reduplication as well.
The output of this study is a context-sensitive, longest-match stemming algorithm for Kambaata words. To evaluate the stemmer's effectiveness, error counting method was applied. A test set of 2425 distinct words was used to evaluate the stemmer. The output from the stemmer indicates that out of 2425 words, 2349 words (96.87%) were stemmed correctly, 63 words (2.60%) were over stemmed and 13 words (0.54%) were under stemmed. What is more, a dictionary reduction of 65.86% has also been achieved during evaluation.
The main factor for errors in stemming Kambaata words is the language's rich and complex morphology. Hence a number of errors can be corrected by exploring more rules. However, it is difficult to avoid the errors completely due to complex morphology that makes use of concatenated suffixes, irregularities through infixation, compounding, blending, and reduplication of affixes.
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMSIJMIT JOURNAL
In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu &
Marathi language). Especially, during last few years, a wide range of information in Indian regional
languages has been made available on web in the form of e-data. But the access to these data repositories
is very low because the efficient search engines/retrieval systems supporting these languages are very
limited. Hence automatic information processing and retrieval is become an urgent requirement. To train
the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating
suffix rules two different approaches, namely, frequency based stripping and length based stripping have
been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The
experiment results shows that in the case of Urdu language the frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix stripping algorithm.
A Word Stemming Algorithm for Hausa Languageiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
STANDARD ARABIC VERBS INFLECTIONS USING NOOJ PLATFORMijnlc
This article describes the morphological analysis of a standard Arabic natural language processing, as a
part of an electronic dictionary-constricting phase. A fully 3-lettered inflected verbs model are formalized
based on a linguistic classification, using NOOJ platform, the classification gives certain representative
verbs that will considered as lemmas, this verbs form our dictionary entries, they are also conjugated
according to our inflection paradigm relying on certain specific morphological properties. This dictionary
will be considered as an Arabic resource, which will help NLP applications and NOOJ platform to analyse
sophisticated Arabic corpora.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
Deterministic Finite State Automaton of Arabic Verb System: A Morphological S...CSCJournals
Finite State Morphology serves as an important tool for investigators of natural language processing. Morphological Analysis forms an essential preprocessing step in natural language processing. This paper discusses the morphological analysis and processing of verb forms in Arabic. It focuses on the inflected verb forms and discusses the perfective, imperfective and imperatives. The deterministic finite state morphological parser for the verb forms can deal with Morphological and orthographic features of Arabic and the morphological processes which are involved in Arabic verb formation and conjugation. We use this model to generate and add all the necessary information (prefix, suffix, stem, etc.) to each morpheme of the words; so we need subtags for each morpheme. Using Finite State tool to build the computational lexicon that are usually structured with a list of the stems and affixes of the language together with a representation that tells us how words can be structured together and how the network of all forms can be represented.
Using automated lexical resources in arabic sentence subjectivityijaia
A common point in almost any work on Sentiment analysis is the need to identify which elements of
language (words) contribute to express the subjectivity in text. Collecting of these elements (sentiment
words) regardless the context with their polarities (positive/negative) is called sentiment lexical resources
or subjective lexicon. In this paper, we investigate the method for generating Sentiment Arabic lexical
Semantic Database by using lexicon based approach. Also, we study the prior polarity effects of each word
using our Sentiment Arabic Lexical Semantic Database on the sentence-level subjectivity and multiple
machine learning algorithms. The experiments were conducted on MPQA corpus containing subjective and
objective sentences of Arabic language, and we were able to achieve 76.1 % classification accuracy.
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
Morphological analysis is an essential stage in language engineering applications. For the Arabic language, this stage is not easy to develop because the Arabic language has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. These reasons make the design of the morphological analyzer for Arabic somewhat difficult and require lots of other tools and treatments. The volume of the lexicon is another big problem of the morphological analysis of the Arabic Language which affects directly the process of the analyzing. In this paper we present a Morphological Analyzer for Modern Standard Arabic based on Arabic Morphological Automaton technique and using a new and innovative language (XMODEL) to represent the Arabic morphological knowledge in an optimal way. Both the Arabic Morphological Analyzer and Arabic Morphological Automaton are implemented in Java language and used XML technology. Buckwalter Arabic Morphological Analyzer and Xerox Arabic Finite State Morphology are two of the best known morphological analyzers for Modern Standard Arabic and they are also available and documented. Our Morphological Analyzer can be exploited by Natural Language Processing (NLP) applications such as machine translation, orthographical correction, information retrieval and both syntactic and semantic analyzers. At the end, an evaluation of Xerox and our system is done.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively.
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
Many automatic translation works have been addressed between major European language pairs, by
taking advantage of large scale parallel corpora, but very few research works are conducted on the
Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel
Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic
text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation
of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using
Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system.
LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM
based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score
of 12%, 11%, and 6% respectively.
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
Many automatic translation works have been addressed between major European language pairs, by taking advantage of large scale parallel corpora, but very few research works are conducted on the Amharic-Arabic language pair due to its parallel data scarcity. However, there is no benchmark parallel Amharic-Arabic text corpora available for Machine Translation task. Therefore, a small parallel Quranic text corpus is constructed by modifying the existing monolingual Arabic text and its equivalent translation of Amharic language text corpora available on Tanzile. Experiments are carried out on Two Long ShortTerm Memory (LSTM) and Gated Recurrent Units (GRU) based Neural Machine Translation (NMT) using Attention-based Encoder-Decoder architecture which is adapted from the open-source OpenNMT system. LSTM and GRU based NMT models and Google Translation system are compared and found that LSTM based OpenNMT outperforms GRU based OpenNMT and Google Translation system, with a BLEU score of 12%, 11%, and 6% respectively
Similar to Designing a Rule Based Stemmer for Afaan Oromo Text (20)
The Use of Java Swing’s Components to Develop a WidgetWaqas Tariq
Widget is a kind of application provides a single service such as a map, news feed, simple clock, battery-life indicators, etc. This kind of interactive software object has been developed to facilitate user interface (UI) design. A user interface (UI) function may be implemented using different widgets with the same function. In this article, we present the widget as a platform that is generally used in various applications, such as in desktop, web browser, and mobile phone. We also describe a visual menu of Java Swing’s components that will be used to establish widget. It will assume that we have successfully compiled and run a program that uses Swing components.
3D Human Hand Posture Reconstruction Using a Single 2D ImageWaqas Tariq
Passive sensing of the 3D geometric posture of the human hand has been studied extensively over the past decade. However, these research efforts have been hampered by the computational complexity caused by inverse kinematics and 3D reconstruction. In this paper, our objective focuses on 3D hand posture estimation based on a single 2D image with aim of robotic applications. We introduce the human hand model with 27 degrees of freedom (DOFs) and analyze some of its constraints to reduce the DOFs without any significant degradation of performance. A novel algorithm to estimate the 3D hand posture from eight 2D projected feature points is proposed. Experimental results using real images confirm that our algorithm gives good estimates of the 3D hand pose. Keywords: 3D hand posture estimation; Model-based approach; Gesture recognition; human- computer interface; machine vision.
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...Waqas Tariq
Camera mouse has been widely used for handicap person to interact with computer. The utmost important of the use of camera mouse is must be able to replace all roles of typical mouse and keyboard. It must be able to provide all mouse click events and keyboard functions (include all shortcut keys) when it is used by handicap person. Also, the use of camera mouse must allow users troubleshooting by themselves. Moreover, it must be able to eliminate neck fatigue effect when it is used during long period. In this paper, we propose camera mouse system with timer as left click event and blinking as right click event. Also, we modify original screen keyboard layout by add two additional buttons (button “drag/ drop” is used to do drag and drop of mouse events and another button is used to call task manager (for troubleshooting)) and change behavior of CTRL, ALT, SHIFT, and CAPS LOCK keys in order to provide shortcut keys of keyboard. Also, we develop recovery method which allows users go from camera and then come back again in order to eliminate neck fatigue effect. The experiments which involve several users have been done in our laboratory. The results show that the use of our camera mouse able to allow users do typing, left and right click events, drag and drop events, and troubleshooting without hand. By implement this system, handicap person can use computer more comfortable and reduce the dryness of eyes.
A Proposed Web Accessibility Framework for the Arab DisabledWaqas Tariq
The Web is providing unprecedented access to information and interaction for people with disabilities. This paper presents a Web accessibility framework which offers the ease of the Web accessing for the disabled Arab users and facilitates their lifelong learning as well. The proposed framework system provides the disabled Arab user with an easy means of access using their mother language so they don’t have to overcome the barrier of learning the target-spoken language. This framework is based on analyzing the web page meta-language, extracting its content and reformulating it in a suitable format for the disabled users. The basic objective of this framework is supporting the equal rights of the Arab disabled people for their access to the education and training with non disabled people. Key Words : Arabic Moon code, Arabic Sign Language, Deaf, Deaf-blind, E-learning Interactivity, Moon code, Web accessibility , Web framework , Web System, WWW.
Real Time Blinking Detection Based on Gabor FilterWaqas Tariq
New method of blinking detection is proposed. The utmost important of blinking detections method is robust against different users, noise, and also change of eye shape. In this paper, we propose blinking detections method by measuring of distance between two arcs of eye (upper part and lower part). We detect eye arcs by apply Gabor filter onto eye image. As we know that Gabor filter has advantage on image processing application since it able to extract spatial localized spectral features, such line, arch, and other shape are more easily detected. After two of eye arcs are detected, we measure the distance between both by using connected labeling method. The open eye is marked by the distance between two arcs is more than threshold and otherwise, the closed eye is marked by the distance less than threshold. The experiment result shows that our proposed method robust enough against different users, noise, and eye shape changes with perfectly accuracy.
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...Waqas Tariq
A method for computer input with human eyes-only using two Purkinje images which works in a real time basis without calibration is proposed. Experimental results shows that cornea curvature can be estimated by using two light sources derived Purkinje images so that no calibration for reducing person-to-person difference of cornea curvature. It is found that the proposed system allows usersf movements of 30 degrees in roll direction and 15 degrees in pitch direction utilizing detected face attitude which is derived from the face plane consisting three feature points on the face, two eyes and nose or mouth. Also it is found that the proposed system does work in a real time basis.
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...Waqas Tariq
Mobile multimedia service is relatively new but has quickly dominated people¡¯s lives, especially among young people. To explain this popularity, this study applies and modifies the Technology Acceptance Model (TAM) to propose a research model and conduct an empirical study. The goal of study is to examine the role of Perceived Enjoyment (PE) and what determinants can contribute to PE in the context of using mobile multimedia service. The result indicates that PE is influencing on Perceived Usefulness (PU) and Perceived Ease of Use (PEOU) and directly Behavior Intention (BI). Aesthetics and flow are key determinants to explain Perceived Enjoyment (PE) in mobile multimedia usage.
Collaborative Learning of Organisational KnolwedgeWaqas Tariq
This paper presents recent research into methods used in Australian Indigenous Knowledge sharing and looks at how these can support the creation of suitable collaborative envi- ronments for timely organisational learning. The protocols and practices as used today and in the past by Indigenous communities are presented and discussed in relation to their relevance to a personalised system of knowledge sharing in modern organisational cultures. This research focuses on user models, knowledge acquisition and integration of data for constructivist learning in a networked repository of or- ganisational knowledge. The data collected in the repository is searched to provide collections of up-to-date and relevant material for training in a work environment. The aim is to improve knowledge collection and sharing in a team envi- ronment. This knowledge can then be collated into a story or workflow that represents the present knowledge in the organisation.
Our research aims to propose a global approach for specification, design and verification of context awareness Human Computer Interface (HCI). This is a Model Based Design approach (MBD). This methodology describes the ubiquitous environment by ontologies. OWL is the standard used for this purpose. The specification and modeling of Human-Computer Interaction are based on Petri nets (PN). This raises the question of representation of Petri nets with XML. We use for this purpose, the standard of modeling PNML. In this paper, we propose an extension of this standard for specification, generation and verification of HCI. This extension is a methodological approach for the construction of PNML with Petri nets. The design principle uses the concept of composition of elementary structures of Petri nets as PNML Modular. The objective is to obtain a valid interface through verification of properties of elementary Petri nets represented with PNML.
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 BoardWaqas Tariq
The main aim of this paper is to build a system that is capable of detecting and recognizing the hand gesture in an image captured by using a camera. The system is built based on Altera’s FPGA DE2 board, which contains a Nios II soft core processor. Image processing techniques and a simple but effective algorithm are implemented to achieve this purpose. Image processing techniques are used to smooth the image in order to ease the subsequent processes in translating the hand sign signal. The algorithm is built for translating the numerical hand sign signal and the result are displayed on the seven segment display. Altera’s Quartus II, SOPC Builder and Nios II EDS software are used to construct the system. By using SOPC Builder, the related components on the DE2 board can be interconnected easily and orderly compared to traditional method that requires lengthy source code and time consuming. Quartus II is used to compile and download the design to the DE2 board. Then, under Nios II EDS, C programming language is used to code the hand sign translation algorithm. Being able to recognize the hand sign signal from images can helps human in controlling a robot and other applications which require only a simple set of instructions provided a CMOS sensor is included in the system.
An overview on Advanced Research Works on Brain-Computer InterfaceWaqas Tariq
A brain–computer interface (BCI) is a proficient result in the research field of human- computer synergy, where direct articulation between brain and an external device occurs resulting in augmenting, assisting and repairing human cognitive. Advanced works like generating brain-computer interface switch technologies for intermittent (or asynchronous) control in natural environments or developing brain-computer interface by Fuzzy logic Systems or by implementing wavelet theory to drive its efficacies are still going on and some useful results has also been found out. The requirements to develop this brain machine interface is also growing day by day i.e. like neuropsychological rehabilitation, emotion control, etc. An overview on the control theory and some advanced works on the field of brain machine interface are shown in this paper.
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...Waqas Tariq
There is growing ageing phenomena with the rise of ageing population throughout the world. According to the World Health Organization (2002), the growing ageing population indicates 694 million, or 223% is expected for people aged 60 and over, since 1970 and 2025.The growth is especially significant in some advanced countries such as North America, Japan, Italy, Germany, United Kingdom and so forth. This growing older adult population has significantly impact the social-culture, lifestyle, healthcare system, economy, infrastructure and government policy of a nation. However, there are limited research studies on the perception and usage of a mobile phone and its service for senior citizens in a developing nation like Malaysia. This paper explores the relationship between mobile phones and senior citizens in Malaysia from the perspective of a developing country. We conducted an exploratory study using contextual interviews with 5 senior citizens of how they perceive their mobile phones. This paper reveals 4 interesting themes from this preliminary study, in addition to the findings of the desirable mobile requirements for local senior citizens with respect of health, safety and communication purposes. The findings of this study bring interesting insight to local telecommunication industries as a whole, and will also serve as groundwork for more in-depth study in the future.
Principles of Good Screen Design in WebsitesWaqas Tariq
Visual techniques for proper arrangement of the elements on the user screen have helped the designers to make the screen look good and attractive. Several visual techniques emphasize the arrangement and ordering of the screen elements based on particular criteria for best appearance of the screen. This paper investigates few significant visual techniques in various web user interfaces and showcases the results for better understanding and their presence.
Virtual teams are used more and more by companies and other organizations to receive benefits. They are a great way to enable teamwork in situations where people are not sitting in the same physical place at the same time. As companies seek to increase the use of virtual teams, a need exists to explore the context of these teams, the virtuality of a team and software that may help these teams working virtualy. Virtual teams have the same basic principles as traditional teams, but there is one big difference. This difference is the way the team members communicate. Instead of using the dynamics of in-office face-to-face exchange, they now rely on special communication channels enabled by modern technologies, such as e-mails, faxes, phone calls and teleconferences, virtual meetings etc. This is why this paper is focused on the issues regarding virtual teams, and how these teams are created and progressing in Albania.
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...Waqas Tariq
It is a well established fact that the Web-Applications require frequent maintenance because of cutting– edge business competitions. The authors have worked on quality evaluation of web-site of Indian ecommerce domain. As a result of that work they have made a quality-wise ranking of these sites. According to their work and also the survey done by various other groups Futurebazaar web-site is considered to be one of the best Indian e-shopping sites. In this research paper the authors are assessing the maintenance of the same site by incorporating the problems incurred during this evaluation. This exercise gives a real world maintainability problem of web-sites. This work will give a clear picture of all the quality metrics which are directly or indirectly related with the maintainability of the web-site.
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...Waqas Tariq
A paradox has been observed whereby web site usability is proven to be an essential element in a web site, yet at the same time there exist an abundance of web pages with poor usability. This discrepancy is the result of limitations that are currently preventing web developers in the commercial sector from producing usable web sites. In this paper we propose a framework whose objective is to alleviate this problem by automating certain aspects of the usability evaluation process. Mainstreaming comes as a result of automation, therefore enabling a non-expert in the field of usability to conduct the evaluation. This results in reducing the costs associated with such evaluation. Additionally, the framework allows the flexibility of adding, modifying or deleting guidelines without altering the code that references them since the guidelines and the code are two separate components. A comparison of the evaluation results carried out using the framework against published evaluations of web sites carried out by web site usability professionals reveals that the framework is able to automatically identify the majority of usability violations. Due to the consistency with which it evaluates, it identified additional guideline-related violations that were not identified by the human evaluators.
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...Waqas Tariq
A robot arm utilized having meal support system based on computer input by human eyes only is proposed. The proposed system is developed for handicap/disabled persons as well as elderly persons and tested with able persons with several shapes and size of eyes under a variety of illumination conditions. The test results with normal persons show the proposed system does work well for selection of the desired foods and for retrieve the foods as appropriate as usersf requirements. It is found that the proposed system is 21% much faster than the manually controlled robotics.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
Interface on Usability Testing Indonesia Official Tourism WebsiteWaqas Tariq
Ministry of Tourism and Creative Economy of The Republic of Indonesia must meet the wide audience various needs and should reach people from all levels of society around the world to provide Indonesia tourism and travel information. This article will gives the details in the evolution of one important component of Indonesia Official Tourism Website as it has grown in functionality and usefulness over several years of use by a live, unrestricted community. We chose this website to see the website interface design and usability and to popularize Indonesia tourism and travel highlights. The analysis done by looking at the criteria specified for usability testing. Usability testing measures are the ease of use (effectiveness, efficiency, consistency and interface design), easy to learn, errors and syntax which is related to the human computer interaction. The purpose of this article is to test the usability level of the website, analyze the website interface design, and provide suggestions for improvements in Indonesia Official Tourism Website of analysis we have done before.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Designing a Rule Based Stemmer for Afaan Oromo Text
1. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 1
Designing a Rule Based Stemmer for Afaan Oromo Text
Debela Tesfaye dabookoo@yahoo.com
Faculty of Informatics/Department of
Information Science Addis Ababa
University Jimma, 378, Ethiopia
Ermias Abebe ermiasabe@gmail.com
Faculty of Informatics/Department of
Information Science Addis Ababa
University Addis Abeba, Ethiopia
Abstract
Most natural language processing systems use stemmer as a separate module in their
architecture. Specially, it is very significant for developing, machine translator, speech
recognizer and search engines. In this work, a stemming system for Afan Oromo is
presented. This system takes as input a word and removes its affixes according to a
rule based algorithm. The result of the study is a prototype context sensitive iterative
stemmer. Error counting technique was employed to evaluate the performance of this
stemmer. The errors were analyzed and classified into two different categories: under
stemming and over stemming errors. For testing purpose corpus which is collected from
different public Afaan Oromo newspapers and bulletins is used. Newspapers, bulletins
and public magazines are considered as consisting different issues of the community:
social, economical, technological and political issues. This will reduce the probability of
making the corpus biased toward some specific words that do not appear in everyday
life. According to the evaluation of the experiments, it can be concluded that an overall
accuracy of the stemmer is encouraging which shows stemming can be performed with
low error rates in high inflected languages such as Afan Oromo.
Keywords: Afan Oromo stemmer, Rule based Stemmer, Context sensitive stemmer
1. INTRODUCTION
We can find a variety of internet search engines with advanced search parameters for retrieving
documents in a document collection. During the development of search engines we can notice an
ongoing specialization on the searching features. These engines are becoming more and more
sophisticated trying to cover user’s demands to access specific information.
One of the attempts to make the search engines more effective in information retrieval was the usage of
word stemming. For grammatical reasons, documents are going to use different forms of a word, such as
organize, organizes, and organizing. Additionally, there are families of derivationally related words with
similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if
it would be useful for a search for one of these words to return documents that contain another word in
the set. Stemming makes families of derivationally related words with similar meanings represented
2. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 2
using single term. A stemming algorithm is a procedure that reduces all words with the same stem to a
common form by stripping of its derivational and inflectional suffixes [1]. Using Stemming, many
contemporary search engines associate words with prefixes and suffixes to their word stem, to make the
search broader in the meaning that it can ensure that the greatest number of relevant matches is included
in search results.
Afaan Oromo, one of the major languages that is widely spoken and used in Ethiopia, is morphologically
very productive; derivation, reduplication and compounding are also common [2]. Obviously, these
extensive inflectional and derivational features of the language are presenting various challenges for text
processing and information retrieval tasks in Afaan Oromo.
This paper describes the development and evaluation of a context sensitive rule based stemmer for Afan
Oromo. Most of the concepts are adopted from stemming algorithm developed by Porter [3]. We have
chosen to modify the stemming algorithm developed by Porter [3] because it is well known and is
frequently used in experimental IR systems.
2. STEMMING ALGORITHMS
A basic characteristic of stemming algorithm is whether it is context-free or context sensitive, which refers
to any attribute of the remaining stem.
2.1 Context-free
In context-free algorithm, no restriction is placed on the removal of a suffix and thus any ending, which
matches, is accepted for stripping.
2.1. Context sensitive
In context sensitive algorithms, however, various restrictions are placed on the usage of the suffix.
Therefore, such kind of algorithm requires the construction of suffix dictionary and the formation of a set
of rules defining the morphological context of the suffixes. The dictionary gives the exact suffix form, while
the rules define conditions to be tested in order to apply the rules.
One of the widely used context sensitive rule based stemmer is that of Porter [3]. The Porter stemmer has
five steps and applying rules within each step. Within each step, if a suffix rule matched to a word, then
the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was
removed, in the way defined by the rule. For example such a condition may be, the number of vowel
characters, which are followed be a consonant character in the stem (Measure), must be greater than one
for the rule to be applied [3]. Within each phase there are various conventions to select rules, such as
selecting the rule from each rule group that applies to the longest suffix.
In context sensitive stemmers, If the set of rules defining the correct morphological context for the suffix is
satisfied, it is replaced by another string, either the null string (if the suffix is to be removed) or specified
replacement string (for example, to create nominal forms to adjectival forms). Both dictionary and rules
require careful analysis of vocabulary and language behavior, and are thus time-consuming to create.
However, generally such techniques are rewarded by high accuracy and speed, and simplicity in
implementation [4].
3. OVER VIEW OF AFAN OROMO
Afaan Oromo is one of the major African languages that is widely spoken and used in most parts of
Ethiopia and some parts of other neighbor countries like Kenya and Somalia. Currently, it is an official
3. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 3
language of Oromia state (which is the largest Regional State among the current Federal States in
Ethiopia). It is used by Oromo people, who are the largest ethnic group in Ethiopia, which amounts to
34.5% of the total population [5]. With regard to the writing system, Qubee (a Latin-based alphabet) has
been adopted and become the official script of Afaan Oromo since 1991.
Like a number of other African and Ethiopian languages, Afaan Oromo has a very rich morphology
(Oromoo, 1995). It has the basic features of agglutinative languages where all bound forms (morphemes)
are affixes. In agglutinative languages like Afaan Oromo most of the grammatical information is conveyed
through affixes (prefixes, infixes and suffixes) attached to the roots or stems. Both Afaan Oromo nouns
and adjectives are highly inflected for number and gender. In contrast to the English plural marker s (-es),
there are more than 12 major and very common plural markers in Afaan Oromo nouns (example: -oota, -
ooli, -wwan, -lee, -an, -een, -oo, etc.) [2]. Afaan Oromo verbs are also highly inflected for gender, person,
number and tenses. Moreover, possessions, cases and article markers are often indicated through affixes
in Afaan Oromo. Since Afaan Oromo is morphologically very productive, derivations and word formations
in the language involve a number of different linguistic features including affixation, reduplication and
compounding [2].
In this research, the morphological analysis of the language is organized in to 6 categories. The
categories are: nouns, pronouns and determinants, case and relational concepts, functional words, verb
and adverbs. Almost all Oromo nouns in a given text have person, number, gender and possession
markers which are concatenated and affixed to a stem or singular noun form. Like wise, determinants
have number, gender, adjectives, and quantifier markers similar to Afan Oromo nouns. Afaan Oromo
verbs are also highly inflected for gender, person, number, tenses, voice and transitivity. Furthermore,
prepositions, postpositions and article markers are often indicated through affixes in Afaan Oromo.
4. RELATED WORKS
Very limited works have been done in the past in the areas of stemming in relation to Afaan Oromo. One
of the stemmer is developed by Kekeba, Varma and Pingali[7] and it used to develop an Oromo-English
CLIR system that enable user to access and retrieve online information sources that are available in
English by using Afan Oromo queries.
The stemmer used a rule based suffix-stripping algorithms focusing on very common inflectional suffixes
of Oromo language. This light stemmer is designed to automatically remove frequent inflectional suffixes
attached to headwords (base-word forms) of Afaan Oromo. Some of the common suffixes that have been
considered in their light stemmer include gender (masculine, feminine), number (singular or plural), case
(nominative, dative), possession morphemes and other related morphological features in Afaan Oromo.
This kind of stemming techniques can have several shortcomings when applied to heavily inflection
language such as Afaan Oromo. These shortcomings can include:
• Improper removal of some affixes (part of a word might appear to be a prefix or suffix). Their
stemmer simply strips of any end of a word that matches one of the affixes in a list without any
condition to be tested. Afaan Oromoo suffixes are not quite different from non suffix endings.
Suffixes like -aa as in the case of maqaa,mucaa;-ee,-lee as in the case of killee, eelee, itilee,
mee, kee, ree; -an,-n in the case of aannan, ilkaan, Afaan, shan, kan; -tuu, -tu as in the case of
utuu, hatuu, nyaatu, baatu, kaatu are part of a root word as indicated in the above words that
should not be conflated. There fore most word endings can be part of a word and suffix as well.
So simple removal of endings can create invalid stems. But the above stemmer conflates the
words since the endings of the words matches one of the suffixes.
• The stemmer didn`t considered words formed by duplication of some characters at all. But Afaan
Oromo is rich in this kind of word formation. Most of the adjectives form the plural by reduplication
4. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 4
of the first syllable. For example words like, jajjabaa (stron-plural), gaggabaabaa (short-plural)
are formed from -jabaa and -gabaabaa by duplicating the first syllabus, respectively.
Another stemmer is developed by Wakshum[10] which use suffix table in combination with rules that
strips off suffix from a given word by looking up the longest match suffix in the suffix list[10]. List of
suffixes are compiled automatically by counting the most frequent endings. Other linguistically valid
suffixes are also included manually. The stemmer fined the longest suffixes that matchs the end of a
given word and remove. The problems he faced are similar with that of Kekeba, Varma and Pingali[7].
Some of these are: irregular formation of variants from root word, the challenge to increase the number
and complexity of the rules and words formed by duplication of some characters. The lists of the suffixes
are not linguistically valid. Out of 342 suffixes compiled by the stemmer only 70 are linguistically valid.
This is because the lists of suffixes are compiled statistically and requires no analysis of the language.
The suffixes are compiled by counting and sorting the most frequent endings. One great problem
occurred with this kind of compilation of suffixes is that during conflation frequently occurring endings
which are part of root word is considered as suffixes and removed.
Another problem is that, the compilation of stop words is also done statistically and frequently occurring
content bearing words are also included. For example,barannoo,barattoo,barnoota are varieties of the
root barat(to learn) ,duree(rich),fayyadam(to use),dhiyeess(to approach),barsiisu(to teach), barsiisa
(teacher), agarsiis (to show) and the like are include as stop word. And this indicates that frequently
occurring content bearing words of Afan Oromo are not considered by the stemmer. More than 96
content bearing words that occur frequently are included as stop word.
This necessitates the need for the detailed knowledge of the language to use such frameworks to
generate a stemmer that handles words formed by duplication of some characters and other under/over
stemming problems.
5. AFAN OROMO STEMMER
5.1. Introduction
It is not possible to apply the stemming algorithms developed for English or other languages like porter’s
[3] to Afan Oromo due to differences in the patterns of word formations and differences in their
morphologies. Some of the concepts from Porter stemmers are however adopted to develop a stemmer
for Afan Oromo. Specifically, Concepts about measure, arranging the rules in clusters ,analyzing word
formation based on the nature of their endings(for example words that attaches the suffixes –de must end
with b/g/d in Afan Oromo) are taken from porter algorithm.
Afan Oromo stemmer is based on a series of steps that each removes a certain type of affix byway of
substitution rules. These rules only apply when certain conditions hold, e.g. the resulting stem must have
a certain minimal length. Most rules have a condition based on the so-called measure. The measure is
the number of vowel-consonant sequences (where consecutive vowels or consonants are counted as
one) which are present in the resulting stem. This condition must prevent that letters which look like a
suffix but are just part of the stem will be removed. Other simple conditions on the stem are:
• Does the stem end with a vowel?
• Does the stem end with a consonant?
• Does the stem end with specific character?
• Does the 1st syllabus of the stem duplicated?
5. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 5
5.2. Extensions to Porter Stemmer`s Implementation
Most Afan Oromo words form repetition by duplicating some of the starting characters. Because this affix
exhibits certain pattern that can be recognized, the algorithm has been extended to handle them. The
original Porter stemmer [3] only treats suffixes.
5.3. Definitions of Afan Oromo Stemmer
Define Afan Oromo vowel as one of
a e i o u
Define Afan Oromo consonants as one of
b c d f g h j k l m n p
q r s t v w x y z `
Define a valid de, du, di, do, dan-ending as one of
b g d
R1 is the region after the first non-vowel following a vowel. If the word starts in vowel it is the region
before the next consonant.
R2 is the region after the first non-vowel following a vowel in R1
C1 is the firs character of a word.
5.4. Compilation of Stop Word List
As can be seen from the table, the stop word list consists of prepositions, conjunctions, articles, and
particles. The stop word list is collected and compiled based on information in A Grammatical sketch of
Written Oromo [6] and Caasluga Afaan Oromoo, Jildi I [2]. The list of linguistically valid Afan Oromo
prepositions, conjunctions, articles, particles are available on the above mentioned books.
TABLE 1: Afan Oromo stop word list examples
There are no any content-bearing words in the stop word list.
Number word English
1 kana This
2 sun That
3 ani Me?
4 inni He
5 isaan They
6 ise She
7 akka Like
8 Ana Me
9 fi and
6. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 6
5.5. The Test Set
A corpus is a collection of texts or speech stored in an electronic machine-readable format [11]. Balanced
corpus is needed to process natural language processing tasks like stemming. Balanced corpus is a
corpus that represents the words that are used in a language. As indicated in [11] texts collected from a
unique source, say from scientific magazines, will probably be biased toward some specific words that do
not appear in everyday life. Such types of corpuses are not balanced corpus so that they are not
appropriate for many natural languages processing in general and stemming in particular except for
special purposes. However, developing a balanced corpus is one of the difficult tasks in NLP research
because it requires collecting data from a wide range of sources: fiction, newspapers, technical, and
popular literatures. As a result it requires much time and human effort.
For this particular study, corpus was collected from different popular Afaan Oromo newspapers (Bariisaa,
Bakkalcha Oromiyaa and Oromiyaa) and bulletins (Qabee and Oromiyaa) to balance the corpus.
Newspapers, bulletins and public magazines are considered as consisting different issues of the
community: social, economical, technological and political issues. So that they are a potential source for
collecting balanced corpus for natural language processing tasks.This corpus is used for evaluating the
performance of the stemmer. The corpus consists of 159 sentences (the total of 1621 tokens).
5.6. Afan Oromo Stemmer Rule Clusters
Based on the information written on A Grammatical sketch of Written Oromo [6] and Caasluga Afaan
Oromoo, Jildi I [2] six rule clusters were created for Afan Oromo stemmer.
The rule-clusters are defined by similarity of their pattern in word formation, the level at which the affixes
occur in the word formation process (the most common order/sequence of Afaan Oromo suffixes (within a
given word) is: <stem> <derivational suffixes> <inflectional suffixes> <attached suffixes> [7]) and the
length of the affixes.
Thus, this stemmer removes (from the right end of a given word) first all the possible attached suffixes,
then inflectional suffixes and finally derivational suffixes step by step. This is done to reduce
computational time. Morphemes that are formed out of this sequence can also be removed though it
takes additional computational time. Complex affixes are thus removed in consecutive steps. For
example, baratootarratti (on the students) has four suffixes: -itti, -rra, -oota and -at. Therefore firs -tti, then
-rra, then -oota and finally -at is removed to get the root “bar-“.
In addition to the affix-rules, context sensitive conditions are also designed to cover some specific
phenomena. Examples of these conditions are, Endswith V/C, i.e. when remaining stem ends in a vowel
or consonant; Ends with B/G/D, i.e. when remaining stem ends in the characters B or G or D; Endswith
CC, i.e. when remaining stem ends in double consonant.
The affix-rules have the following general form:
Affix _ substitution measure-condition <additional conditions>
Where:
Affix is a valid Afan Oromo prefix or suffix
In Afan Oromo repetition (plural) is formed by duplicating the first
syllabus and it is also considered as prefix.
7. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 7
Substitution is a string which is substituted with a given affix to produce valid
stem.
Measure-condition is the number of vowel-consonant sequences (where
consecutive vowels or consonants are counted as one) which are present in the
resulting stem.
Additional conditions- additional conditions are also designed to cover some
specific phenomena. Examples of these conditions are, Endswith V/C,
The sixth cluster contains derivational suffixes whose measure is greater or equal to 1 and ends with
consonant.
e.g.
“eenya”
“offaa” 0 measure > =1 EndsWith c
“annoo”,
“ummaa”
Explanation for the sixth rule cluster:
If a word ends with one of the suffixes: “eenya”, “offaa” , “annoo”, “ ummaa” and measure> =1 and the
remaining stem ends with consonant, delete the suffix.
E.g. qabeenya(asset) ends with the suffix “eenya”,measure>=1 and the remaining stem qab- ends with
the consonant b. Therefore it is correctly stemmed to qab-.
The fifth cluster contains suffixes that are removed if measure is greater or equal to 1 or substituted with
the suffix -` if measure equal zero.
e.g.
“`aa”
“’e”,
“’u” 0 measure > =1
“’ee”,
“suu”
“sa”,
“sse”
“nya” ` measure > =0
Explanation for the fifth rule cluster:
If a word ends with one of the suffixes: “`aa”, “’e”, “’u”, “’ee”, “suu”, “sa”, “sse” ,“nya” and measure> =1,
delete the suffix. If measure =0, substitute the suffix with `( glottal).
The fourth cluster covers a special case
e.g.
“du”
“di” 0 measure > =1 EndsWith B/G/D
“dan” d measure =0
“wwan” 0 measure > =1 EndsWith VV
Explanation for the fourth rule cluster:
8. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 8
If a word ends with one of the suffixes: “du”, “di”, “dan” and measure> =1 and the remaining stem ends
with B/G/D, delete the suffix. If measure =0, substitute the suffix with d.
If a word ends with the suffix “wwan” and measure> =1 and the remaining stem ends with double
vowel delete the suffix.
The seventh cluster contain rules that conflate words formed by duplication of the first syllabus
R1 0 measure > =0 Startswith C R1=R2
0-1 measure > =0 Startswith C C1+R1 =R2
C1+` 0 measure > =0 Startswith V `+R1=R2
Explanation for the seventh rule cluster:
If R1 and R2 are the same delete R1.
If C1+R1 and R2 are the same delete R2.
If the word starts with vowel and if glota (`)+R1 equals R2 delete R2.
E.g. gaggabaaba(plural form of short),R1 is ga , R2 is gga and C1 is g. Therefore gaggabaaba is correctly
stemmed to gabaaba by removing R2, since C1+R1 equals R2.
To stem a word the, the stemmer first checks if a given word is in the stop list or not. If found in the list,
the word is excluded from further processing and nothing returned to the calling routine; stop and process
the next word if any. If the word is not in the stop list, the word is checked for any match in the rule
clusters. If a match is found, the respective action for that rule will be taken. As described earlier the rule
has conditions like measure (the number of vowel consonant sequence), ending of the remaining stem
with specific character, ending of the remaining stem with consonant, ending of the remaining stem with
short or long vowel, matching of the ending of the word with one of the suffixes and the detail for each
rule cluster is described in the privies section. The actions taken includes removing the suffixes,
substituting the suffix with another one, removing of the reduplicated characters in the case of words
formed by reduplication of some of the characters as described in the 7th rule cluster.
5.7. Evaluation of the Stemmer
In this report, error counting approach is adapted to evaluate the algorithm in terms of the number of
accurately conflated results. The number of correctly conflated words and incorrectly conflated ones are
counted for analysis. The output from the stemmer was then checked against the respective expected
valid stem. These errors were then described in terms of under stemming and over stemming:
• Over stemming
Over stemming occurs when too much of the term is removed.
• Under stemming
Under stemming occurs when too little of the term removed.
Evaluation of the effectiveness of stemming algorithms is necessary to reveal specific error patterns. This
information can subsequently be used to improve the algorithm where possible. Some error types,
however, are inherent to the suffix-stripping method and without the additional information provided by,
for instance, a dictionary, these errors cannot be avoided [8].
This stemmer is run on the test set of 5000 words which is assumed to be balanced. The literature from
which the rule of the stemmer developed is totally different from the test set. This was done deliberately in
order to predict the performance of the stemmer in the real world data.
9. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 9
The output from the stemmer indicates, Out of 5000 words 38 words (0.77%) were under stemmed and
220 words (4.39 %) were over stemmed. Totally this stemmer generates 258 words (5.16 %) stemming
error. As a result, the accuracy of the stemmer becomes 94.84%.
In terms of compression, i.e., reduction of dictionary size, percentage of compression is calculated using
the formula [9]:
C = 100 * (W - S)/W
Where,
C is the compression value (in percentage)
W is the number of the total words
S is a distinct stem after conflation
Accordingly,
Size of the data = 5000
Number of distinct stem after conflation = 1654
Hence, the percentage of compression for Afan Oromo text based on the evaluation text for this stemmer
becomes 100 * (5000- 3100) / 5000 = 38%.
Reasons for the under stemming and over stemming problems are:
1. It was difficult to come up with the complete rule because of the complexity of the language. More
conditions/rules are required based on more study of the morphology of the language.
2. Homographs are words which are spelled identically but nevertheless have a different meaning.
The algorithm does not have access to information about, for instance, word categories; the
different senses of these types of words are not distinguished.
3. The rule designed for the suffix -s is not general and conflates some terms incorrectly. Designing
general rule for Words ending in -s is challenging.
4. Some Compound words are not conflated correctly. This stemmer didn`t include any rule that
handles compound words. Even though there is no rule included to conflate compound words,
rules that are designed for non compound words can be applied and produce correct result for
most compound words. Examples of compound words that are conflated correctly are:
karadeemaa(passenger) is correctly conflated to “karadeem-”, biyyalafaa(world) is correctly
conflated to “biyyalaf-“. But manabaate(married(f)) is incorrectly conflated to “manab-“.
5. “ni” and “hin” are considered as prefixes in some literatures and independent terms in other
literatures. There fore, this stemmer didn`t include the rule that remove them when they appear
as prefix. Both cases are possible in the language.
6. CONSLUSION & FUTURE WORK
Stemming is important for highly inflected languages such as Afan Oromo for many applications that
require the stem of a word. In this work, a context sensitive rule based stemmer was developed that
attempts to determine the stem of a word according to linguistic rules. According to the evaluation of the
experiments, it can be concluded that an overall accuracy of about 94.84% is an encouraging figure. The
proposed method generates some errors. Indeed, it is possible to anticipate such considerable
10. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 10
contributions and positive effects of the stemmer since Afaan Oromo is one of the morphologically rich
and complex languages. These errors were analyzed and classified into two different categories (under
stemmed words and over stems). The error rate is about 4.27%. This shows that rule based stemmer can
be performed with low error rates in high inflected languages such as Afan Oromo.
A big step in the future improvement of the Afan Oromo stemmer can be a study on how the word
compounding and suffixes affect Afan Oromo words and their stems, and how one can include new rules
that do not affect the effectiveness of the stemming process.
Besides, further study is required to increase the effectiveness of the rule based stemmer with no or little
decrease in efficiency. Extracting other additional context sensitive conditions which is based on the study
of the morphology of the language can increase the accuracy of the stemmer.
Moreover, the stemmer has to be tested with large amount of texts to prove its real performance. To
succeed this we need to apply Afan Oromo stemmer in a web search engine, which retrieves information
from Afan Oromo texts. Then we can have a complete view of the stemming system and the returned
results after every search request. In this case we can do extended evaluation tests, we can measure the
precision and recall in various texts and we can estimate the errors distribution in the stemming results.
All the rules described in this work can be a base for this further research and it can support extended
stemming rules covering most of the terms in the Afan Oromo.
7. REFERENCES
[1] L. JB. “Development of a stemming algorithm”. Mechanical Translation and Computational
Linguistics, 11: 22-31,(1968)
[2] G. Q. A. Oromoo. “Caasluga Afaan Oromoo Jildi I”, Komishinii Aadaaf Turizmii Oromiyaa,
Finfinnee, Ethiopia, pp. 105-220 (1995)
[3] M. F Porter. “An algorithm for suffix stripping”. Program, 14(3):130–137,(1980)
[4] S. Jacques. “Stemming of French Words Based on Grammatical Categories.” Journal of
American Society for Information Science 44(1): 1-9, (1993)
[5] Census Report. “Ethiopia’s population now 76 million”. (2008) available at:
http://ethiopolitics.com/news
[6] C. G. Mewis.” A Grammatical sketch of Written Oromo”, Germany: Koln,pp. 25-99 (2001)
[7] K. K. Tune, V. Varma and P. Pingali. “Evaluation of Oromo-English Cross-Language Information
Retrieval”, Language Technologies Research Centre IIIT, Hyderabad India, (2007)
[8] W. Kraaij, R. Pohlmann.”Porter’s stemming algorithm for Dutch”. Bioinformatics, 25: 1412–1418,
(1997)
[9] L. Lessa. “development of stemming algorithm for wolaytta text”, Masters Thesis Addis Ababa
University, Fuculty of Informatics, Department of Information Science, (2003)
[10] W. Mekonen. “Development of stemming algorithm for Affan Oromo anguage text”, MSc thesis
faculty of informatics, Addis Ababa University, Addis Ababa,(2000)
11. Debela Tesfaye & Ermias Abebe
International Journal of Computational Linguistics (IJCL), Volume (1): Issue (2) 11
[11] S. Dandapat, S. Sarkar and A. Basu. “A Hybrid Model for Part-f-Speech Tagging and its
Application to Bengali”, Journal of world information society, 43(6):384–390, (2004)