Word Sequence Prediction for Afaan Oromo Using CRF

WOLLEGA UNIVERSITY
INSTITUTE OF TECHNOLOGY
SCHOOL OF GRADUATE STUDIES
DEPARTMENT OF COMPUTER SCIENCE
PROGRAM: - MSc (R)
WORD SEQUENCE PREDICTION FOR AFAAN OROMO USING
CONDITIONAL RANDAM FIELD APPROACH
A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE
REQUIREMENT FOR THE DEGREE OF MASTER OF
SCIENCE IN COMPUTER SCIENCE
M.Sc. Thesis
BY:
Adugna Gemechu (ID: SGSGR/17/045)
Major Advisor: Dr.Getachew Mamo
Co-Advisor:Mr.Misganu Tuse
Sunday, 21 November 2021
Nekemte, Ethiopia.

WOLLEGAUNIVERSITY
SCH OOL OF GRADUATE STUDIES
P.O. Box: 395, Nekemte, Ethiopia.
__________________________________________________
APPROVAL S HEET FOR SUBMITTING FINAL THESIS
As members of the Board of Examining of the Final MSc. thesis open defence, we certify that we
have read and evaluated the thesis prepared by Adugna Gemechu under the title: entitled “Word
Sequence Prediction Using Conditional Random Field” and this thesis be accepted as fulfilling
the thesis requirement for the Degree of Master of Science in Computer Science complies with the
regulations of the University and meets the accepted standards with respect to originality and
quality.
Dr. Kumera Samy _______________ _________________
Chairperson Signature Date
Mr. Tariku __________________ ________________
Internal Examiner Signature Date
Dr. Tekilu Urgessa ___________________ ________________
External Examiner Signature Date
Final Approval and Acceptance
Thesis Approved by________________________________ ______________________
_______________
Department PGC Signature Date
______________________________ ________________ _______________
Dean of College Signature Date
Certification of the Final Thesis
I hereby certify that all the correction and recommendation suggested by the board of examiners
are incorporated into the final thesis entitled “Word Sequence Prediction Using Conditional
Random Field” by Adugna Gemechu.
______________________________ ________________ ________________
Dean of SGS Signature Date

STATEMENT OF THE AUTHOR
I Mr. Adugna Gemechu hereby declare and affirm that the thesis entitled “entitled “Word
Sequence Prediction Using Conditional Random Field” is my own work conducted under the
supervision of Getachew Mamo (PhD). I have followed all the ethical principles of scholarship in
the preparation, data collection, data analysis and completion of this thesis. All scholarly matter
that is included in the thesis has been given recognition through citation. I have adequately cited
and referenced all the original sources. I also declare that I have adhered to all principles of
academic honesty and integrity and I have not misrepresented, fabricated, or falsified any idea /
data / fact / source in my submission. This thesis is submitted in partial fulfilment of the
requirement for a degree from the Post Graduate Studies at Wollega University. I further declare
that this thesis has not been submitted to any other institution anywhere for the award of any
academic degree, diploma or certificate.
I understand that any violation of the above will be cause for disciplinary action by the University
and can also evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been taken when needed.
Name: Adugna Gemechu Signature: _________________
Date:
School/Department: Computer Science
College: Engineering and Technology

Acknowledgements
Above all special thanks to God Almighty for the gift of life and His unending provision and
protection.
First, I would like to express my deepest appreciation to my advisor Dr. Getachew Mamo & co
advisor Mr. Misganu Tuse for their immediate response to help me and give me direction, for their
wish to provide continuous support and professional guidance throughout my thesis work. I would
also like to thank my Brother Mr.Alebachew Gemechu, P/Tefera Gonfa, EV/Fayera Adisu,
EV/Fekede Bedasa, Mr. Itana Fiqadu and all other friends for their support and companionship.
Next to God, My deepest gratitude goes to my lovely wife Kume Dinsa for her memorable support
and inspiration that has kept me going all the way and to my family who were always motivating
and supporting me to work hard. I would like to recognize the people who assisted me throughout
the work, because without them I would not have come this far.
MAY THE LORD REWARD YOU ABUNDANTLY

ii
Abstract
Word prediction is a popular machine learning task, which consists of predicting the next word in sequence
of words. Literature shows that word sequence prediction could play a great role in real life
applications including electronic based data entry. Word prediction deals with guessing what word
comes after, based on some current information, and it is the main focus of this study. Even though Afaan
Oromo is used by a large number of populations, few works are done on the topic of word sequence
prediction. Previous works on word prediction shows that statistical methods are not enough with highly
inflected language and needs syntactical information.
In this study, we developed Afaan Oromo word sequence prediction following the Design science research
methodology with statistical methods using Conditional Random Field. We used 225,352 words 150,000,
phrases to train the model by incorporating detailed parts of speech, for stem/Root and morphological
features respectively. The experiments were CRF model on a window size of three, five and seven. We
explained the efficacy of Stem/Root Word, morphological feature, and part of speech tag in Afaan Oromo
word sequence prediction.
Evaluation was performed using developed model and keystroke savings (KSS) as a metrics. According to
our test, prediction results using a CRF with detailed Parts of Speech tag model has higher KSS and
performed slightly better compared to those without Parts of Speech tag. Therefore, statistical approach with
detailed POS with window size of seven has good potential on word sequence prediction for Afaan Oromo
language.
Keywords: Word sequence prediction, Stem/Root, Parts of Speech, CRF

iii
Table of Contents
Acknowledgements............................................................................................................................... i
Abstract................................................................................................................................................ii
Table of Contents................................................................................................................................iii
List of Algorithms............................................................................................................................... vi
List of tables ......................................................................................................................................vii
List of Figure ....................................................................................................................................viii
List of Appendix’s ........................................................................................................................... ix
Acronyms and Abbreviations .............................................................................................................. x
CHAPTER ONE.................................................................................................................................. 1
1. INTRODUCTION........................................................................................................................... 1
Table of Contents
Acknowledgements............................................................................................................................... i
Abstract................................................................................................................................................ii
List of Algorithms............................................................................................................................... vi
List of tables ......................................................................................................................................vii
List of Figure ....................................................................................................................................viii
List of Appendix’s ........................................................................................................................... ix
Acronyms and Abbreviations .............................................................................................................. x
CHAPTER ONE.................................................................................................................................. 1
1. INTRODUCTION........................................................................................................................... 1
1.1 Background of the study ............................................................................................................ 1
1.2 Statement of the problem ........................................................................................................... 4
1.3 Research Questions .................................................................................................................... 6
1.4 Objectives................................................................................................................................... 6
1.4.1 General Objectives .............................................................................................................. 6
1.4.2 Specific Objectives.............................................................................................................. 7
1.5 Scope and limitation of the study............................................................................................... 7
1.6 Significance of the Study ........................................................................................................... 7
1.7 Methodology .............................................................................................................................. 8
1.7.1 Introduction ......................................................................................................................... 8

iv
1.7.2 Literature Review ................................................................................................................ 8
1.7.3 Data Collection.................................................................................................................... 8
1.7.4 Data preparation .................................................................................................................. 8
1.7.5 Development Techniques .................................................................................................... 9
1.7.5.1 Feature set:...................................................................................................................... 10
1.7.5.2 Condition Random Fields (CRF).................................................................................... 10
1.7.6 Training data CRF ............................................................................................................. 12
1.7.6 Testing ............................................................................................................................... 13
1.7.7 Development tools............................................................................................................. 13
1.7.8 Prototype Development:.................................................................................................... 13
1.7.9 Evaluation.......................................................................................................................... 13
CHAPTER TWO............................................................................................................................... 15
LITERATURE REVIEW.................................................................................................................. 15
2.1 Natural Language Processing................................................................................................... 15
2.2 Word Prediction ....................................................................................................................... 17
2.3 Historical Background.............................................................................................................. 17
2.4 Approaches to word sequence prediction................................................................................. 18
2.4.1 Statistical word prediction................................................................................................. 18
2.4.2 Knowledge based word prediction .................................................................................... 19
2.4.2.1 Syntactic knowledge for word prediction....................................................................... 19
2.4.2.2 Semantic prediction........................................................................................................ 19
2.4.2.3 Pragmatics prediction ..................................................................................................... 20
2.4.3 Heuristic word prediction.................................................................................................. 20
2.5 Word Prediction for Western Languages................................................................................. 20
2.5.1 Word Prediction for English ................................................................................................. 23
2.6. Word Prediction for Hebrew Language .................................................................................. 24
2.7. Word Prediction for Russian Language .................................................................................. 25
CHAPTER THREE ........................................................................................................................... 28
RELETED WORK ............................................................................................................................ 28
3. Word Prediction for Ethiopian Language .................................................................................. 28
3.1. For Amharic ............................................................................................................................ 28
3.2. For Afaan Oromo .................................................................................................................... 29
3.3. Summary ................................................................................................................................. 30
CHAPTER FOUR ............................................................................................................................. 31

v
WORD SEQUENCE PREDICTION MODEL FOR AFAAN OROMO LANGUAGE .................. 31
4.1. Architecture of Afaan Oromo WSP Model............................................................................. 31
4.2. Morphological Analysis of Corpus ......................................................................................... 32
4.3. Building CRF Model............................................................................................................... 34
4.3.1. Root or Stem Words with Aspect......................................................................................... 34
4.3.2 Root or Stem Words with Voice ........................................................................................... 35
4.3.3 Root or Stem Words with Prefix........................................................................................... 36
4.3.4 Root or Stem Words with Prefix and Suffix ......................................................................... 36
4.3.5. Stem Words with Tense ....................................................................................................... 37
4.4. Morphological Analysis of User Input.................................................................................... 38
4.6. Surface Words......................................................................................................................... 39
CHAPTER FIVE ............................................................................................................................... 40
EVALUATION & DISCUSSION..................................................................................................... 40
5.1. Introduction............................................................................................................................. 40
5.2. Corpus Collection and Preparation ......................................................................................... 40
5.3. Implementation........................................................................................................................ 41
5.3 Test...................................................................................................................................... 42
5.4.1. Test Environments............................................................................................................ 42
5.4.2.Testing Procedure.............................................................................................................. 42
5.5 Evaluation and Testing............................................................................................................. 44
5.5.1 Keystroke Savings................................................................................................................. 44
5.5.2 Test Results ........................................................................................................................... 45
5.6. Discussion ............................................................................................................................... 49
CHAPTER SIX.................................................................................................................................. 50
CONCLUSIONS & RECOMMENDATIONS.................................................................................. 50
6.1. Conclusions............................................................................................................................. 50
6.2. Recommendations................................................................................................................... 52
Reference........................................................................................................................................... 53
Appendix............................................................................................................................................ 56
ANNEXES......................................................................................................................................... 62

vi
List of Algorithms
Algorithm 4.2 to Build a Tagged Corpus………………………………………………………..33
Algorithm 4.3 Algorithm to Construct Stem and Aspect CRF model…….......................34
Algorithm 4.4 to construct CRF model for stem and voice…………………………...........35
Algorithm 4.5 to construct Stem With Prefix CRF Model………………………………….36
Algorithm 4.5 describes an algorithm to construct this model...................................37
Algorithm 4.6 for construct a CRF model of stem form with tense...........................36
Algorithm 4.7 to produce appropriate surface words.................................................39

vii
List of tables
Table 5.1 Test result 1.........................................................................................................................45
Table 5.4 Test results 4………………………………………………………………………….......47
Table 5.5 Test result 5……………………………………………………………….........................48

viii
List of Figure
Figure 5.1: User Interface of Word Sequence Prediction……………..……………...............................41
Figure 4.1: Architecture of AOWSP Model…………………………………………32

ix
List of Appendix‟s
Appendix A: Sample corpus used for training…………………………………………………..56
Appendix B: POST Used In AOWSP Model……………………………………………………58
Appendix D: Demonstration of word based model with Z=3………………………………59
Appendix E: Demonstration of word based model with Z=6……………………………….59
APPENDIX F. Sample Of Morphological Analysis………………………………………......60
Annex
Annex 1: List of Suffixes with their Probability…………………………………………………………62

x
Acronyms and Abbreviations
AO : Afaan Oromo
AONTs : Afaan Oromo news texts
AOWSPS : Afaan Oromo Word Sequence Prediction System
AAC : Augmentative and Alternative Communication
WSp : Word Sequence Prediction
WP : Word prediction
CRF_WSPAO : Conditional Random Field Word Sequence Prediction for Afaan Oromo
CRF : Conditional Random Field
GB : Giga Byte
HMM : Hidden Markov Model
ML : Machine Learning
NL : Natural Language
NLP : Natural Language Processing
POS : Parts of Speech
DVD-RW : Rewritable Digital Versatile Disc
USB : Universal Serial Bus
KSS : Keystroke
KS : keystroke
POSBPM : POS_Based Prediction Model
POST : Part Of Speech Tag

1
CHAPTER ONE
1. INTRODUCTION
1.1 Background of the study
Natural language processing (NLP) is a theoretically motivated range of computational
techniques for analyzing and representing naturally occurring texts at one or more levels of
linguistic analysis for the purpose of achieving human-like language processing for a range of
tasks or applications(Wilks, 1996). In NLP, the contribution of a Computer Science discipline
is to develop an algorithm that specifies the internal representation of data and define the way
they can be efficiently processed in a computer (Bose, 2004).
The idea of giving computers the ability to process human language is as old as the idea of
computers themselves. Research in natural language processing has been going on for several
decades dating back to the late 1940s (Wilks, 1996). During those times, it was a difficult
task, even though not impossible, for it is based on human made rules. Preparing rules
requires extensive involvement of talented linguistic experts and it is a time consuming work.
In recent years researches in the area of NLP are increasing rapidly.
These attributes shifted NLP from rule-based approach to machine-learning approaches. The
existence of machine-learning approaches resulted in an open and comfortable environment
that encourages the development of different NLP based systems. So, NLP has become an
applied, rather than a theoretical science.
A word prediction system facilitates the typing of text for users with physical or cognitive
disabilities. As the user enters each letter of the word, the system displays a list of most likely
completions of the partially typed word. As the user continues typing more letters, the system
updates the suggestion list accordingly. If the required word is in the list, the user can select it
with a single keystroke. Then, the system tries to predict the next word. It displays a list of
suggestions to the user, who can select the next intended word if it appears in the list.
Otherwise, the user can enter the first letter of the next word to restrict the suggestions. The
process continues until the completion of the text(Ghayoomi & Daroodi, 2006).
For someone with physical disabilities, each keystroke is an effort; as a result, the prediction
system saves the user's energy by reducing his or her physical effort. Additionally, the system

2
assists the user in the composition of the well-formed text qualitatively and quantitatively.
Moreover, the system helps to increase the user‟s concentration
As human needs increases development of technology also increase.in case these human
started to use data entry techniques. One of the most prevalent and necessary techniques used
as an interface between human and machine is data entry technique. This technique is helpful
to enter different kinds of data such as text, voice, image, movie, etc. to the machine in order
to get processed. Since word prediction facilitates the data entry technique, the most
commonly used data entry techniques have been presented in this section.
There are a number of data entry techniques that include speech, chorded keyboards,
handwriting recognition, various gloved techniques, scanner, microphone, and digital camera.
Keyboards and pointing devices are the most commonly used devices during human-
computer interaction. Because of its ease of implementation, higher speed, and less error rate,
keyboard dominated text entry system.
Prediction can be either character or word prediction. Also Prediction includes forecasting
what comes after, based on some current information. Word prediction aims at easing word
insertion in textual software by guessing the next word, or by giving the user a list of possible
options for the next word.
The term natural language processing encompasses a broad set of techniques for automated
generation, manipulation and analysis of natural or human languages. Although most NLP
techniques inherit largely from linguistics and artificial Intelligence, they are also influenced
by relatively newer areas such as machine learning, computational statistics and cognitive
science.
According to Shannon human language are highly redundant. These redundancies can be
captured in language models. The goal of language modelling is to capture and exploit the
restriction imposed on the way in which word can be combined to form sentences. It
describes how words are arranged in natural language. Language modelling has many
applications in natural language processing problems, such as automatic speech recognition,
statistical machine translation, text summarization, and character and handwriting
recognition. Word prediction, which is a natural language processing problem that attempt to
predict the correct and most appropriate word in given context, utilizes language modelling

3
application to guess the next word given in previous words1
We understand the characteristics of the language from these researches which provide hint
how to design the system. In general word prediction deals with predicting a correct word in
a sentence. It saves time, keystrokes and also reduces misspelling.
Afaan Oromo is one of the major languages that is widely spoken and used in Ethiopia.
Currently, it is an official language of Oromia regional state. It is used by Oromo people, who
are the largest ethnic group in Ethiopia, which amounts to 34.5% of the total population
according to the 2007 census. In addition, the language is also spoken in Kenya, and Somalia.
With regard to the writing system, Qubee (Latin-based alphabet) has been adopted and it
became the official script of Afaan Oromo since 1991. Besides being an official working
language of Oromia regional State, Afaan Oromo is the instructional medium for primary and
junior secondary schools throughout the region and its administrative zones. Thus, the
language has well established and standardized writing and spoken system(Tesfaye, 2011).
In addition, we would have been done for the word prediction using Conditional Random
fields (CRFs). Conditional Random fields (CRFs) is a class of statistical modelling method
that is generally applied in machine learning and pattern recognition, where they are used for
structured prediction. It was an extension of both Maximum Entropy Model (MEMs) and
Hidden Markov Models (HMMs) that was firstly introduced by Lafferty et al., 2001.Whereas
an ordinary classifier predicts a label for a single sample without regard to adjacent samples.
A CRF can take context into account. It‟s a discriminative of undirected probabilistic
graphical model. It is used to encode known relationships between observations and construct
consistent interpretations.
Moreover, the language is offered as a subject from grade one throughout the schools of the
region. Few literature works, a number of newspapers, magazines, educational resources,
official credentials and religious documents are published and available in the
language(Wakshum Tesgen, 2017).

4
Motivation
There are various word prediction software packages to assist users on their text entry.
Swedish(Gustavii & Pettersson, 2003), English(Agarwal, n.d.), Italian(Aliprandi et al., 2007),
Persian(Mansouri et al., 2014) are some of word prediction studies conducted lately. These
studies contribute in reducing the time and effort to write a text for slow typists, or people
who are not able to use a conventional keyboard
Afaan Oromo started to use Qube (Latin-based script) for writing system people who use
Qube has difficulties typing and/or Qube use many characters compared to English languages
which slow down the typing process. As the researcher knowledge there is few attempts
research on word sequence prediction for Afaan Oromo. Hence, we are motivated to work on
word sequence prediction. In addition Ethiopia usage of computers and different electronic
devices are growing from day to day. However, most software programs used with these
devices are in English. On the contrary, a great number of people in Ethiopia communicate
using Afaan Oromo language. Having this in mind, alternative or assistive Afaan Oromo text
entry system is useful to speed up text entry, and helps those needing alternative
communication. Hence, in this study we would focus on word sequence prediction to address
this issue.
1.2 Statement of the problem
Word prediction is one of the most widely used techniques to enhance communication rate in
augmentative and alternative communication. A number of word prediction software
packages exist for different languages to assist users on their text entry. Amharic, Swedish,
English, Italian, Persian, Bangle are some of word prediction studies conducted
lately(Delbeto, 2018b).
Word sequence prediction is a challenging task for inflected languages. Inflected languages
are languages that are morphologically rich and have enormous word forms. i.e. one word
can have different forms. As Afaan Oromo language is highly inflected language and
morphologically rich it shares this problem. This problem makes word prediction system
much more difficult and results poor performance.
Many researches were conducted to integrate and to make Afaan Oromo language beneficiary
of the technology. In those studies, an attempt was made on Automatic sentence parser, part

5
of speech tagging, morphology based spell checker and rule based Afaan oromo, grammar
checker are researches conducted on the area of NLP and Afaan Oromo (Wakshum, 2017).
The overall goal of this study was to enable computer to perform useful task involving Afaan
Oromo language. Word prediction researches conducted so far in Ethiopia are specific for
certain language such for Amharic and Afaan Oromo. Nevertheless, as far as the knowledge
of the present researcher is concerned, in Ethiopia studies on word prediction are few.
Gudisa Tesema(Tesema & Abate, 2013) who made the first attempt to design and develop
word prediction on mobile phone. In his study, he identified word prediction as a
classification task. By using SVM bag of word would be created for word with the same
class. Predictor model was constructed using HMM for the prediction purpose. The major
focus of their works have been classify words in some class without considering
morphological information.
Since Afaan Oromo is rich in inflated morphological features, these issue must considered
since context information was added to account but the former doesn‟t consider these issue,
so the researcher going to fill these gap by including morphological features of Afaan Oromo
language by using different methods/techniques like CRF which is popular model such like
tasks .
Wakshum Temesgen(Wakshum, 2017) recommended that the language model built from
stem forms, and morphological feature such as tense, case, person, gender and number are
used. But, these morphological features are not sufficient by using N-gram techniques. N-
gram techniques wasn't score more accuracy on bases of Afaan Oromo POS. Considering
these problem the researcher going to fill these gap by using CRF techniques for Afaan
Oromo POS, and stem/Root words.
N-gram model can‟t take large context information into account. So the techniques what the
researcher going to use CRF is can take context information into account.
Ashenafi Bekele(Delbeto, 2018b) conducted a research to develop Afaan Oromo Word
Sequence Prediction using n-gram model. The system is developed based on word frequency
and syntax statistics. Therefore, to fill these gap, the researcher considering recurrence of
words on syntax and semantic methods along with highest frequency to make more precise
feature prediction by using CRF approach techniques. The other gap is the model tested by

6
corpus with too small in size, it provides wrong word sequence prediction output. Know in
case of these gap the researcher going to fill these gap by using CRF techniques for Afaan
Oromo word sequence prediction system. These techniques is used for any language which is
tractable and flexible to develop Word Sequence Prediction system. And also the researcher
going to increase the size of corpus to get correct word sequence prediction output.
In general among machine learning techniques, the researchers haven‟t seen a research
conducted to develop WSP model using CRF for Afaan Oromo, and these method doesn't
been explored in its full potential for AOWSP system.
The purpose of this research is to design and develop Afaan Oromo word sequence prediction
model and check the performance of this model towards predicting next words from AONTs
with consideration of context information .morphological, and linguistic features including
multi-words.
1.3 Research Questions
Based on the statement problem given above, this study attempted to answers for the
following basic research questions:
 What technique and specific method should be applied to give efficient
performance for Afaan Oromo language word sequence prediction
 How does CRF possibly enable in predicting WSP from AONT corpus?
 What is the performance of the word sequence prediction model in keystroke
saving?
Thus, the current study focuses on how to implement prediction of WSs from Afaan Oromo
news texts corpus aiming at answering the above questions.
1.4 Objectives
The objectives of the study are explained separately as general objective and specific
objectives.
1.4.1 General Objectives
The general objective of the study is to develop Word Sequence Prediction for Afaan Oromo
Using Conditional Random Field Approach

7
1.4.2 Specific Objectives.
The specific objectives of the proposed research are to:
i. To review literatures about developing word sequence predictions methods and
structures of the target language.
ii. Collect representative corpus for training and testing the model.
iii. Morphologically analyze the training corpus.
iv. To design and develop Afaan Oromo word sequence prediction model.
v. To test the performance of the developed system with different parameters.
vi. Finally, draw conclusion from test results
vii. Recommend for further research area in the future.
1.5 Scope and limitation of the study
The aim of the study is to develop word sequence prediction system for AO based on
Conditional Random Field, and. We developed Afaan Oromo word sequence prediction
model using statistical approach incorporating syntactic information which in our case is
stem/root Word, and part of speech tag. We tried to show the importance of part of speech tag
in order to have better prediction. The corpus used had an impact on our keystroke savings
since CRF needs huge data for the training. The focus in this study is only on achieving
efficiency of the word sequence prediction model for Afaan Oromo language. Including
prediction way and the prototype of user interface. We didn‟t‟ evaluate with difference metric
in case of limitation of time.
1.6 Significance of the Study
The beneficiaries of this study include researchers who are, or want to be, involved in
increasing the capability of computer processing in Afaan Oromo. Specially, this study
benefits researchers who devoted on Afaan oromo predictive text entry project. This is
because the result of the study provides them concrete concept on aspects to be considered to
improve keystroke saving and reduce cognitive load of writing skill. Also the study will be
used as review any other researchers.

8
1.7 Methodology
1.7.1 Introduction
This section describes the whole processes of the study which would be take place
sequentially for the purpose of answering the proposed research problem effectively and
efficiently accompanied with consequent and appropriate performance evaluations.
Therefore, each process of the study is explained separately as follows:
To achieve the objectives of the research we would‟ve use a number of methods.
1.7.2 Literature Review
Researches and related works would have been thoroughly reviewed to grasp a firm
knowledge with the intention of developing appropriate word sequence prediction model for
Afaan Oromo. In addition, a discussion would‟ve been made with Afaan Oromo Linguistic
experts regarding the linguistic nature of the language like the grammatical structure and
morphology of Afaan Oromo.
1.7.3 Data Collection
Generally, since the intention of this research is detecting word sequence in Afaan Oromo
news texts and categorizing them into their respective types, the appropriate data would have
been gathered from Afaan Oromo news texts which are currently being used in
communication media. With any learning technique, one important question is how much
training data is required to achieve acceptable performance. In order to achieve the
reasonable accuracy in WSP system, it is desirable to have a large corpus and the
determination of high order training parameters is inevitable.
1.7.4 Data preparation
In this case, after gathering the necessary data having at least one predicted sequence word,
the pre-processing functions have to be done. This is because there is no pre-processed corpus
data available for Afaan Oromo yet. This is the most hindrance factor affecting the
advancement of researches to be conducted on Afaan Oromo language processing application
areas. Pre-processing steps are done before the training process, in order to normalize all data
sets and reduce merging errors. Hence, these pre-processing tasks include.

9
Annotation This is the process where the training corpora must be converted into the format
which can possibly be applied in training module. This includes: -
i) Tokenization – In this case, the whole document is split into constituent sentences and
further each sentence is again split into its constituent words to ease the successive training
process.
ii) POS Tagging – This is the process of assigning correct part of speech to each word in a
sentence. In WSP system, POS tagging helps to identify automatically all words which are
candidate to be taken as WP according to their POS tags, as well as it describes how likely
the nearby words can be taken as a single WP based on their POS tags in order to chunk them
together.
iii) Chunking – This is the process of identifying and breaking of a large text into logical
prediction.
vi) WSP Tagging – After all the above processes are done, each word and chunked phrase
are tagged with their corresponding word sequence. Finally, such WS tagged corpus
(annotated corpus) is in a trainable form and can be further applied for successive processes
1.7.5 Development Techniques
There are many approaches used to develop WSP for any Natural Language (NL). These
approaches are broadly divided into Rule-based approach, Machine Learning (ML) approach
and Hybrid approach. For instance, ML (statistical) approach is subdivided into supervised,
semi-supervised and unsupervised approaches. Among all these approaches, the currently
proposed research work aims to use the statistical method for developing WSP system for
Afaan Oromo. The reason behind this is that, in statistical approach a corpus is initially
studied, and based on the corpus, a training module is made where the system is trained to
identify the WP and then on their occurrences in the corpus with particular context data entry
and class, a probability value is counted. Every time when text is given based on the
probability value the result will fetched. For better effectiveness large amount of annotated
training data are required. Further, among ML approaches, the current study aims to use
supervised ML approach and from all types of this approach, CRF is intentionally proposed
for the work.

10
1.7.5.1 Feature set:
Set of features that would‟ve been applied to Afaan Oromo WSP (AOWSP) task, are context
word feature (Previous and next words of a particular word), Word suffix and prefix of the
current, previous and/or next token, the Part Of Speech (POS) tags of the current and/or the
surrounding word(s), and morphological features like case, aspect, voices, tense would‟ve
been be used.
1.7.5.2 Condition Random Fields (CRF)
Conditional Random fields (CRFs) is a class of statistical modelling method that is generally
applied in machine learning and pattern recognition, where they are used for structured
prediction. It was an extension of both Maximum Entropy Model (MEMs) and Hidden
Markov Models (HMMs) that was firstly introduced by Lafferty et al., 2001.Whereas an
ordinary classifier predicts a label for a single sample without regard to adjacent samples. A
CRF can take context into account. It‟s a discriminative of undirected probabilistic graphical
model. It is used to encode known relationships between observations and construct
consistent interpretations.
Conditional random fields (CRFs) are a class of statistical modelling method often applied in
pattern recognition, machine learning and used for structured prediction. Whereas a classifier
predicts a label for a single sample without considering "neigh boring" samples, a CRF can
take context into account. To do so, the prediction is modelled as a graphical model, which
implements dependencies between the predictions. What kind of graph is used depends on the
application. For example, in natural language processing, linear chain CRFs are popular,
which implement sequential dependencies in the predictions(Wikipedia,2017).
Conditional random field define conditional probability distribution P(Y|X) of given
sequence given input Word/sentence. Y is the “class label” sequence and X denotes as the
observation word sequence. A common used special case of CRFs is linear chain, which has
a distribution of: PΛ(y|x) = exp(∑ ∑ λkfk (yt−1,yt ,x,t) k T t=1 ) Zx ( 1 ) Where 𝑓𝑘(𝑦𝑡−1, 𝑦𝑡 ,
𝑥,𝑡) is usually an indicator function. λ𝑘 is the learned weight of the feature and Z𝑥 is the
normalization factor that sum the probability of all state sequences. The feature functions can
measure any aspect of a state transition, 𝑦𝑡−1 𝑡𝑜 𝑦𝑡 and the entire observation sequence, x,
centred at the current time step, t. Here we use three conditional random field models to

11
calculate the conditional probability of the missing sentences, redundant sentences, disorder
sentences and error selection sentences (Yeh et al., 2015).

12
Advantages of CRF
Language independent, i.e., it is possible to use it for any language domain.
Easily understandable and is easy to implement and analyse.
Used for any amount of data so the system is scalable.
Able to solve sequence labelling problem very efficiently.
Having not fixed states, i.e., one can use it for any number of hidden tag sets and so it
is dynamic in nature(Ammar et al., 2014).
Generally, CRF is used in WSP system because of the efficiency of the Viterbi algorithm
mused in decoding the WP-class state sequence.
1.7.6 Training data CRF
After completion of annotation tasks, the annotated data is sent to the CRF training module
which estimates necessary parameters of CRF which is used in further processes to take
decision for optimal tag sequence. In this module, the basic task to be done is the estimation
of CRF parameters on this training data. This concept is explained as follows:
A) States =The various types of tags in tagged corpus are the CRF states. Those types of
word sequence tags listed under the scope of the current study are firstly stored as
state vector.
B) Start probability =This is the probability that a particular predicting tag comes at the
start of a sentence in a training corpus. This probability is important factor, because
based on this first appearing tag; it is possible to determine what will possibly the
succeeding tag be. That is,
C) Transition Probability - This is the probability of the occurrence of one word tag
(Ti) following another immediate preceding word tag (Ti-1). That is,
D) Emission Probability - This is the probability of assigning particular tag to the word
in the corpus or document. That is,

13
When the two are the same, the derivative is zero. Therefore we see that training can be
thought of as finding λ‟s that match the two counts(Groza et al., 2012).
1.7.6 Testing
As soon as knowledge of optimal tagging is available, testing phase can be performed. After
calculating all these parameters, they are applied to Viterbi algorithm and separate testing
sentence as an observation to find word prediction. First of all, testing sentence is tokenized,
POS tagged, then chunked and each token is pushed to Viterbi algorithm to decide tag for the
token. Deciding of tag will be based on knowledge acquisition from trained corpus. The idea
behind the algorithm is that of all the state sequences, only the most probable of these
sequences need to be considered. Finally, the output from this algorithm is the most likely
sequence of tags generating a given sequence of observations according to the highest
probability value it yields(Geleta, 2020).
1.7.7 Development tools
Morphological analyser and generator program is used to build tagged training corpus and to
produce surface words. Similarly, to analyse user input words morphological analyser will
use. Morphological analyser and synthesizer tool available freely for Afaan Oromo.
Moreover, Python programming language is used to develop prototype for demonstration.
Python is selected because of it has a natural language toolkit module that provides a
predefine function for the implementation of CRF modelling.
1.7.8 Prototype Development:
A prototype will be developed in order to check whether our study works in accordance with
the ideas and theories of word sequence prediction.
1.7.9 Evaluation
Prototype development is one of the objectives of this study in order to demonstrate and
evaluate the developed model. Morphological features tagged test data will be used and the

14
prediction activity is evaluated through calculation of keystroke savings. A Keystroke Saving
(KSS) estimates saved effort percentage and is calculated through comparison of total
number of keystrokes needed to type a text (KT) and effective number of keystrokes using
word prediction (KE) (Aliprandi et al., 2008). Hence,
Therefore, the number of keystrokes to type texts taken from the test data with and without
word sequence prediction program will be counted to calculate keystroke savings
accordingly. The obtained KSS will be compared for CRF models. The model that shows
maximum keystroke saving is considered as better model.

15
CHAPTER TWO
LITERATURE REVIEW
2.1 Natural Language Processing
Words are the fundamental building block of language. Every human language, spoken,
signed, or written, is composed of words. Every area of speech and language processing,
from speech recognition to machine translation to information retrieval on the web, requires
extensive knowledge about words. Psycholinguistic models of human language processing
and models from generative linguistic are also heavily based on lexical knowledge(Pollock et
al., 2018).
Natural Language Processing (NLP) is the computerized approach to analysing text that
is based on both a set of theories and a set of technologies. And, being a very active area
of research and development.
Natural Language Processing is a theoretically motivated range of computational techniques
for analysing and representing naturally occurring texts at one or more levels of linguistic
analysis for the purpose of achieving human-like language processing for a range of tasks or
applications(Wilks, 1996).
NLP technologies are becoming extremely important in the creation of user friendly decision-
support systems for everyday non-expert users, particularly in the areas of knowledge
acquisition, information retrieval and language translation(Bose, 2004).
An NLP system must possess considerable knowledge about the structure of the language
itself, including what the words are, how to combine the words into sentences, what the
words mean, how these word meanings contribute to the sentence meaning, and so on. The
system would need methods of encoding and using this knowledge in ways that will produce
the appropriate behaviour. Furthermore, the knowledge of the current situation (or context)
plays a crucial role in determining how the system interprets a particular sentence(Bose,
2004).
NLP is a theoretically motivated range of computational techniques for analyzing and
representing naturally occurring texts at one or more levels of linguistic analysis for the
purpose of achieving human-like language processing for a range of tasks or

16
applications(Wilks, 1996). In NLP, the contribution of a Computer Science discipline is to
develop an algorithm that specifies the internal representation of data and define the way they
can be efficiently processed in a computer (Bose, 2004).
The idea of giving computers the ability to process human language is as old as the idea of
computers themselves. Research in natural language processing has been going on for several
decades dating back to the late 1940s (Wilks, 1996). During those times, it was a difficult
task, even though not impossible, for it is based on human made rules. Preparing rules
requires extensive involvement of talented linguistic experts and it is a time consuming work.
In recent years researches in the area of NLP are increasing rapidly.
These attributes shifted NLP from rule-based approach to machine-learning approaches. The
existence of machine-learning approaches resulted in an open and comfortable environment
that encourages the development of different NLP based systems. So, NLP has become an
applied, rather than a theoretical science.
With over sixty years, natural language systems are still very complicated to design and they
are still not perfect because human language is complex, and it is difficult to capture the
entire linguistic knowledge for one hundred percent accuracy in processing. However, NLP
technologies are becoming extremely vital in order to ease access into systems, thereby
making those systems more user-friendly.
In the future, NLP is expected to further improve existing systems and hope that we will see
technologically intelligent and more user-friendly systems. NLP provides a good baseline for
both theoretical and implementation of a range of applications. In fact, any application
associated with text is a candidate for NLP.
The researcher have explain the basic idea of this discipline before dealing with our main
concept that is word sequence prediction. So this Chapter discusses fundamental concepts of
word sequence prediction and ideas associated with Afaan Oromo language. Prediction
methods like statistical, knowledge based, and heuristics are presented in order to understand
clear overview of the topic. The main target of this study is to design and develop word
sequence prediction model for Afaan Oromo language. Hence, morphological characteristics,
grammatical properties, and parts-of-speech of the language are discussed in respective
sections of this chapter.

17
2.2 Word Prediction
Word prediction is about guessing what word the user intends to write for the purpose of
facilitating the text production process. Sometimes a distinction is made between systems that
require the initial letters of an upcoming word to make a prediction and systems that may
predict a word regardless of whether the word has been initialized or not. The former systems
are said to perform word completion while the latter perform proper word prediction
(Gustavii et al., 2003).
We can easily define the term word sequence prediction once we capture the essence of its
essential components which are “sequence” and “prediction”. A sequence is a finite or
infinite list of terms (or numbers or things) arranged in a definite order, that is, there is a rule
by which each term after the first may be found. Prediction is concerned with guessing the
short-term evolution of certain phenomena. Forecasting tomorrow‟s temperature at a given
location or guessing which asset will achieve the best performance over the next month could
be examples of prediction problems. One must predict the next element of an unknown
sequence given some knowledge about the past elements and possibly other available
information. The entities involved in forecasting task are the elements forming the sequence,
the criterion used to measure the quality of a forecast, the protocol specifying how the
predictor receives feedback about the sequence, and any possible side information provided
to the predictor. Therefore, word sequence prediction is forecasting or guessing the next word
the user intends to write or to insert based on some previous information(n.d,2011).
The first users of word prediction systems have traditionally been physically disabled
persons. For people having motor impairments making it difficult to type, a word prediction
system would optimally reduce spelling error, time and effort needed for producing a text, as
the number of keystroke decreases.
2.3 Historical Background
The inception of the concept of word prediction takes us back to the end of the Second World
War when the number of people with disabilities was increased dramatically. In order to help
them to communicate with outside world, assistance technologies such as Augmentative and
Alternative Commination systems were developed. The field of Augmentative and
Alternative Communication (AAC) is concerned with mitigating communication barriers that

18
would isolate individuals from society. Basically, one way to improve communication rate is
to decrease the number of keys entered to form a message, and the goal of saving keystroke
requires estimating the next letter, word or phrasing that likely follow a given segment of
text. As a result, early 1980‟s word predictions techniques have been established as a method
in the development of AAC systems. Since 1980s, many systems with different methods were
developed for different languages. According to Shannon, human languages are highly
redundant, and these redundancies can be captured in language models. To this end, the goal
of language modelling is to capture and exploit the restriction imposed on the way in which
word can be combined to form sentences. It describes how words are arranged in natural
language. Word predictions are also applied in language modelling application to guess the
next word given in previous words(Wakshum,2017).
2.4 Approaches to word sequence prediction
Word prediction methods can be classified as statistical, knowledge based and heuristic
(Adaptive) modelling. Most of existing methods employ statistical language models using
word n grams and POS tags.
2.4.1 Statistical word prediction
In statistical modelling, the choice of words is based on probability that a string may appear
in a text. The statistical information and its distribution could be used for predicting letters,
words, and phrases. Statistical word prediction is made based on Markov assumption in
which only the last n-1 words of the history affects succeeding word and it is named n-gram
Markov model. It is based on learning parameters from large corpora. However, one of the
challenges in this method is when a language that is being written with the help of word
prediction system is of a different style than the training data(Tesfaye, 2011).
Traditionally, predicting words has solely been based on statistical modelling of the language
In statistical modelling, the choice of words is based on the probability that a string may
appear in a text. Consequently, a natural language could be considered as a stochastic system.
Such a modelling is also named probabilistic modelling. The statistical information and its
distribution could be used for predicting letters, words, phrases, and sentences (Ghayoomi &
Momtazi, 2009).

19
2.4.2 Knowledge based word prediction
The systems that used statistical modelling for prediction often predict the words that are
grammatically inappropriate. And therefore, impose a heavy cognition load on the user to
choose the intended word and as a result the writing rate decreases. Knowledge Based
Modelling involves omitting inappropriate words from the prediction list and gives more
accurate results to the user (Makkar et al., 2015).
Word prediction systems that merely use statistical modelling for prediction often present
words that are syntactically, semantically, or pragmatically inappropriate and impose a heavy
cognition load on users to choose the intended word in addition to performance decrease in
writing speed. To solve the problem syntactic, semantic and pragmatic linguistic knowledge
can be used in prediction systems(Tesfaye, 2011).
2.4.2.1 Syntactic knowledge for word prediction
Syntactic prediction is a method that tries to present words that are appropriate syntactically
in that position of the sentence. It means that the syntactic structure of the language is used.
In syntactic prediction Part-of-Speech (POS) tags of all words are identified in a corpus and
the system uses the syntactic knowledge for prediction. Statistical syntax and rule-based
grammar are two general syntactic prediction methods that will be described in more detail.
This method includes various types of probabilistic and parsing methods such as Markov
model and artificial neural network (Ghayoomi & Momtazi, 2009).
2.4.2.2 Semantic prediction
Some of the predicted items in the prediction list could be wrong semantically even though
they are syntactically right. So, suggesting the words that are syntactically and semantically
correct would increase the accuracy of the predictions. To reach the goal, a great semantic
knowledge is tagged to the words and phrases in a corpus. Mostly in semantic prediction
appearance of specific word with special content is a clue to increase the probability of
appearing other words that have semantic relationships to that word. PROSE is a sample of
systems used semantic knowledge of English language(Ghayoomi & Momtazi, 2009).
Two methods are used for semantic prediction. One of these methods is lexical source like
WordNet in English which measures the semantic probability of words to get assured that the

20
predicted words are semantically related in that context. The other method is lexical chain
that assigns the highest priority to the words which are related semantically in that context;
the unrelated words to that context would be removed from the prediction list.
2.4.2.3 Pragmatics prediction
The predictions can be affected by pragmatics. Adding the method to the prediction
procedure attempts to filter the words that are possibly correct syntactically and semantically,
but wrong according to discourse. The pragmatic knowledge is also tagged to the words in a
corpus. Suggesting the words that are correct pragmatically would increase the accuracy of
predictions as well (Delbeto, 2018a).
2.4.3 Heuristic word prediction
Heuristic (adaptation) method is used to make more appropriate predictions for a specific
user and it is based on short term and long term learning. In short term learning, the system
adapts to a user on current text that is going to be typed by an individual user. Recency
promotion, topic guidance, trigger and target, and n-gram cache are the methods that a system
could use to adapt to a user in a single text. However, in long-term prediction the previous
texts that are produced by a user are considered (Tessema, 2014).There are two general
methods that make the system adapted to the users. One of the methods is short-term learning
and the other one is long-term learning.
2.5 Word Prediction for Western Languages
There are some researches conducted on word prediction for western languages like Italian,
Swedish, English, German, French, and Dutch. Aliprandi et al. (Aliprandi et al.,
2007)(Aliprandi et al., 2007), focuses on designing letter and word prediction system called
Fast Type for Italian language. Italian has large dictionary of word forms, which go with a
number of morphological features, produced from a root or lemma and a set of inflection
rules. Statistical and lexical methods with robust open domain language resources which have
been refined to improve keystroke saving are used. The user interface, predictive engine and
linguistic resource are main components of the system. The predictive engine is kernel of
predictive module since it manages communication with the user interface keeping trace of
prediction status and words already typed.
The morpho-syntactic agreement and lexicon coverage, efficiently accessing linguistic

21
resources as language model and very large lexical resources are core functionalities of the
predictive module. In addition, to improve morphological information available for prediction
engine, POS n-grams and Tagged word (TW) n-grams are used.
The prediction algorithm for Italian language is presented by extending combination of POS
tri-grams and simple word bi-grams model. A large corpus prepared from newspapers,
magazines, documents, commercial letters and emails are used to train Italian POS n-grams,
approximated to n = 2 (bi-grams) and n = 3 (tri-grams) and tagged word n-grams,
approximated to n = 1 (uni-grams) and n = 2 (bi-grams). Keystroke saving (KS), Keystroke
until completion (KUC) and Word Type Saving (WTS) are three parameters used to evaluate
the system. The researchers indicate that 40 texts disjoint from training set are used for
testing. However, the size or number of words available in the testing data is not clearly
specified. The result shows 51% keystroke saving, which is comparable to what was achieved
by word prediction methods for non-inflected languages. Moreover, on average 29% WTS,
meaning at standard speed without any cognitive load saving in time and 2.5 KUC is
observed.
Moreover, Matiasek et al.(Matiasek et al., 2002) have done a multilingual text prediction
study and a system named FASTY is developed. The aim of this work is to offer a
communication support system to significantly increase typing speed, which adapts to users
with different language and strongly varying needs. It follows a generic approach in order to
be multilingual so that the concept can be used for most European languages. However, this
study focused on Germen, French, Dutch and Swedish languages. The predictor and language
specific resources are separated by the language independent prediction software, which
helps the system with potential application to many European languages without sacrificing
performance. Preliminary experiments with German as well as experiences with a Swedish
system have shown that n-gram based methods still offer quite reasonable predictive power.
N-gram statistics, morphological processing and backup lexicon, and abbreviation expansion
are core components of this system. The frequency tables of word n-grams are easily
constructed from text corpora irrespective of the target language and incorporating Part-of-
Speech (POS) provides additional precision. The combination of different n-gram statistics
constitutes the base of FASTY predictor providing a baseline performance for all target
languages. Other modules interact with these results and improve on them
Morphological analysis and synthesis are performed and morph-syntactic features needed by

22
the components dealing with checking syntactic appropriateness are extracted since one of
FASTY's goals is to be able to suggest only word forms appropriate for the current context.
Also compound prediction needs morph-syntactic information of compound parts to correctly
predict linking elements. Last but not least, if frequency based lexica run out of words with a
given prefix, the morphological lexicon provided will serve as a backup lexicon and deliver
additional solutions. Morphological processing is implemented via infinite state-transducers,
which provide very fast, bi-directional processing and allow for a very compact
representation of huge lexica.
The grammar-based module is used to enhance the predictive power of FASTY and improve
its precision using syntactic processing in order to deliver only predictions that are not in
conflict with the grammar.
Carlberger et al.(Carlberger et al., 2015) conducted a study on constructing a database for
Swedish language called Profet via extension of available word prediction system which uses
word frequency lexicon, word pair lexicon, and subject lexicon. Profet is a statistical based
word prediction system that has been used for a number of years as a writing aid by persons
with motoric disabilities and linguistic impairments. It gives one to nine word alternatives as
a user starts spelling a word based on the selected settings.
The main task of this work is to enhance the available prediction capability through extension
of scope, addition of grammatical, phrasal and semantic information and using probability
based system. This allows information from multiple sources to be weighted appropriately for
each prediction. The predictor scope is extended considering preceding words in the
prediction. Therefore, prediction is also based on previous words even after typing any letters
of the new word. This leads the word suggestions to be grammatically more correct than
those presently given. Since the available database lacks grammatical information as well as
statistics for the occurrence of sequences longer than two contiguous words, a new database
is built. Besides bi-grams (word and grammatical tag pairs with co occurrence statistics), tri-
grams as well as collocations (non-contiguous sequential word and grammatical tag bi-grams
with 2-5 intervening words) are included. All information in the new database including
collocations must be extracted from one single corpus in order to warrant implementation of a
probabilistic prediction function. This work extends the previous version of profet which
presents one word per line by displaying more than one word per line. It is briefed that
choosing words from the word alternatives can result up to 26% keystroke savings (KSS) and

23
up to 34% in letters when only one word is typed.
2.5.1 Word Prediction for English
Antal van den Bosch (Van Den Bosch, 2005) proposed classification-based word prediction
model based on IGTREE. A decision-tree induction algorithm has been favorable scaling
abilities. Token prediction accuracy, token prediction speed, number of nodes and discrete
perplexity are evaluation metrics used for this work. Through a first series of experiments
they demonstrate that the system exhibits log-linear increases in prediction accuracy and
decreases in discrete perplexity, a new evaluation metric, with increasing numbers of training
examples. The induced trees grow linearly with the amount of training examples. Trained on
30 million words of newswire text, prediction accuracies reach 42.2% on the same type of
text. In a second series of experiments we show that this generic approach to
word prediction can be specialized to confusable prediction, yielding high accuracies on nine
example confusable sets in all genres of text. The confusable-specific approach outperforms
the generic word-prediction approach, but with more data the difference decreases.
Agarwal and Arora (Agarwal, n.d.) proposed a Context Based Word Prediction system for
SMS messaging in which context is used to predict the most appropriate word for a given
code.
The growth of wireless technology has provided alternative ways of communication such as
Short Message service (SMS) and with tremendous increase in mobile Text Messaging,
there is a need for an efficient text input system. With limited keys on the mobile phone,
multiple letters are mapped to same number (8 keys, 2 to 9, for 26 alphabets). The many to
one mapping of alphabets to numbers gives us same numeric code for multiple words.T-9
system predicts the correct word for a given numeric code based on frequency. This may not
give us the correct result most of the time. For example, for code „63‟,two possible words are
„me‟ and „of‟. Based on a frequency list where „of‟ is more likely than „me‟, T-9 system
will always predict „of‟ for code „63‟. So, for a sentence like „Give me a box of chocolate‟,
the prediction would be „Give of a box of chocolate‟. The sentence itself indeed gives us
information about what should be the correct word for a given code. Consider the above
sentence with blanks, “Give _ a box _ chocolate”. The current systems for word prediction in
Text Messaging predict the word for a code based on its frequency obtained from a huge
corpus. However, the word at a particular position in a sentence depends on its context and
this intuition motivated them to use Machine Learning algorithms to predict a word, based on

24
its context. The system also takes into consideration the proper English words for the codes
corresponding to the words in informal language. The proposed method uses machine
learning algorithms to predict the current word given its code and previous word‟s part of
speech (POS). The training was done on about 19,000 emails and the testing was done on
about 1,900 emails, with each email consisting of 300 words on average. The results show 31
% good improvement over the traditional frequency based word estimation.
Trnka(Trnka, 2010) conducted a research on topic Adaptive Language Modelling for Word.
AAC devices are highly specialized keyboards with speech synthesis, typically providing
single button input for common words or phrases, but requiring a user to type letter-by-letter
for other words, called fringe vocabulary. Word prediction helps speed AAC communication
rate. The previous research conducted by different scholars using n-gram models. At best,
modern devices utilize a trigram model and very basic recency promotion. However, one of
the lamented weaknesses of n-gram models is their sensitivity to the training data. The
objective of this work is to develop and integrate style adaptations from the experience of
topic models to dynamically adapt to both topically and stylistically. They address the
problem of balancing training size and similarity by dynamically adapting the language
model to the most topically relevant portions of then training data. They present the results of
experimenting with different topic segmentations and relevance scores in order to tune
existing methods to topic modelling. The inclusion of all the training data as well as the usage
of frequencies addresses the problem of sparse data in an adaptive model. They have
demonstrated that topic modelling can significantly increase keystroke savings for traditional
testing as well as testing on text from other domains. They have also addressed the problem
of annotated topics through fine-grained modeling and found that it is also a significant
improvement over a baseline n-gram model.
2.6. Word Prediction for Hebrew Language
Netzer et al.(Word Prediction in Hebrew And, n.d.) conducted a research on word prediction
for Hebrew language as part of an effort for Hebrew AAC users. Modern Hebrew is
characterized by rich morphology, with a high level of ambiguity. Morphological inflections
like gender, number, person, tense and construct state can be shown in Hebrew lexemes. In
addition, better predictions are achieved when language model is trained on larger corpus
size. In this work the hypothesis that additional morpho-syntactic knowledge is required to
obtain high precision is evaluated. The language model is trained on uni-gram, bi-gram and

25
tri-gram, and experiment is made on four sizes of selection menus: 1, 5, 7 and 9, each
considered as one additional keystroke. According to the result, the researchers state that
syntactic knowledge does not improve keystroke savings and even decreases them, as
opposed to what was originally hypothesized. The result shows keystroke savings up to 29%
with nine word proposals, 34% for seven word proposals and 54% for a single proposal.
Contrary to other works, KSS is improved as the size of selection menu reduced. We believe
that an increase in number of proposals affects search time. However, effect of selection
menu„s size on KSS is not clear and no justification is given by the researchers.
2.7. Word Prediction for Persian Language
Ghayoomi and Daroodi(Ghayoomi & Daroodi, 2006) studied word prediction for Persian
language in three approaches. Persian language is a member of the Indo-European language
family and has many features in common with them in terms of morphology, syntax,
phonology and lexicon. This work is based on bi-gram, tri-gram, 4-gram models and it
utilized around 10 million tokens in the collected corpus. The first approach uses word
statistics, the second one includes main syntactic categories of a Persian POS tagged corpus,
and the third uses main syntactic categories along with their morphological, syntactic and
semantic subcategories. According to the researchers, evaluation shows 37%, 38.95%, and
42.45% KSS for the first second and third approaches respectively.
2.7. Word Prediction for Russian Language
Hunnicutt et al. (Hunnicutt et al., n.d.) performed a research on Russian word prediction with
morphological support as a co-operative project between two research groups in Tbilisi and
Stockholm. This work is an extension of a word predictor developed by Swedish partner for
other languages in order to make it suitable for Russian language. Inclusion of morphological
component is found necessary since Russian language is much richer in morphological forms.
In order to develop Russian language database, an extensive text corpora containing 2.3
million tokens is collected. It provides inflectional categories and resulting inflections for
verbs, nouns and adjectives. With this, the correct word forms can be presented in a
consistent manner, which allows a user to easily choose the desired word form. The
researchers introduced special operations for constructing word forms from a word„s
morphological components. Verbs are the most complex word class and algorithm for

26
expanding root form of verbs to their inflectional form is done. This system suggests
successful completion of verbs with the remaining inflect able words.
2.8. Word Prediction for Sindhi Language
Mahar and Memon (Mahar & Memon, 2011) studied word prediction for Sindhi language
based on bi-gram, tri-gram and 4-gram probabilistic models. Sindhi is morphologically rich
and has great similarity with Arabic, Persian, and Urdu Languages. It is a highly
homographic language and texts are written without diacritic symbols which makes word
prediction task very difficult. The corpus of any language is very important for statistical
language modelling. Hence, in this work, word frequencies are calculated using a corpus
which approximately contains 3 million tokens and a tokenization algorithm is developed to
segment words. Add one smoothing technique is used to assign non zero probabilities to all
probabilities having zero probabilities. 15,000 sentences are randomly selected from the
prepared corpora to evaluate developed models based on entropy and perplexity. According
to the evaluation, 4-gram model is more suitable since it has lower perplexity than bi-gram
and tri-gram models.

27
Summary
In this section, we have reviewed linguistic characteristics of Afaan Oromo like part -of -
speech, morphology and Stem/Root. We understand that Afaan Oromo nouns are inflated for
number, gender and case, verbs are inflated for number, gender, tense voice and aspect and
adjectives are inflated for number and gender. In addition to this, we have discussed different
word prediction approaches like statistical, knowledge based, and heuristics with strength and
weakness of one over another. Depending on the above review Conditional Random field
based modelling would have been adopted

28
CHAPTER THREE
RELETED WORK
3. Word Prediction for Ethiopian Language
3.1. For Amharic
A research conducted on word prediction for Amharic online handwriting recognition.
Nesredin Suleiman(Chang, 2008) and Solomon Atnafu. As the researchers state, the study is
motivated by the fact that speed of data entry can be enhanced with integration of online
handwriting recognition and word prediction mainly for handheld devices. The main target of
the work is to propose a word prediction model for Amharic online handwriting recognition
using statistical information like frequency of occurrence of words. A corpus of 131,399
Amharic words and 17, 137 names of persons and places are prepared. The prepared corpus is
used to extract statistical information like to determine value of n for the n-gram model,
average word length of Amharic language, and the most frequently used Amharic word
length. Hence, n is set to be 2 based on statistical information, and in retrospect to this, the
research is done using bi-gram model, where the intended word is predicted by looking the
first two characters. Finally, a prototype is developed to evaluate performance of the
proposed model and 81.39% prediction accuracy is obtained according to the experiment.
Alemebante Mulu and Goyal(Mulu et al., 2013) performed a research on Amharic Text
Prediction System for Mobile Phone. In this work, they have designed text prediction model
for Amharic language: a corpus of 1,193,719 Amharic words, 242,383 Amharic lexicons and
a list of names of persons and places with a total size of 20,170 has been used. To show the
validity of the word prediction model and the algorithm designed, a prototype is developed.
Amharic text prediction system describes the data entry techniques that are used to enter data
in to mobile devices, such as a smartphone. Data entry could be either predictive or non-
predictive in which the first two characters is written and listed down all predicted word,
based on the frequency of the word as well as going the alphabetical order if the frequency is
the same. The experiment is tested by a database or lexicon of Alembante Mulu also
conducted to measure the accuracy of the Amharic text prediction engine and finally the
prediction accuracy achieved 91.79%.
Tigist Tensou (Tessema, 2014) performed a research on word sequence prediction for

29
Amharic. In this work, Amharic word sequence prediction model is developed using
statistical methods and linguistic rules. Statistical models are constructed for root/stem, and
morphological properties of words like aspect, voice, tense, and affixes are modeled using the
training corpus. Consequently, morphological features like gender, number, and person are
captured from a user‟s input to ensure grammatical agreements among words. Initially, root
or stem words are suggested using root or stem statistical models. Then, morphological
features for the suggested root/ stem words are predicted using voice, tense, aspect, affixes
statistical information and grammatical agreement rules of the language. Predicting
morphological features is essential in Amharic because of its high morphological complexity,
and this approach is not required in less inflected languages since there is a possibility of
storing all word forms in a dictionary. Finally, surface words are generated based on the
proposed root or stem words and morphological features. Word sequence prediction using a
hybrid of bigram and tri-gram model offers better keystroke savings in all scenarios for their
experiment. For instance, when using test data disjoint from the training corpus,
20.5%,17.4% and 13.1% keystroke savings are obtained in hybrid, tri-gram and bi-gram
models respectively. Evaluation of the model is performed using developed prototype and
keystroke savings (KSS) as a metrics. According to their experiment, prediction result using a
hybrid of bi-gram and tri-gram model has higher KSS and it is better compared to bi-gram
and trigram models. Therefore, statistical and linguistic rules have quite good potential on
word sequence prediction for Amharic language.
3.2. For Afaan Oromo
Gudisa Tesema(Tesema & Abate, 2013) who made the first attempt to design and develop
word prediction on mobile phone. In his study, he identified word prediction as a
classification task. By using SVM bag of word would be created for word with the same
class. Predictor model was constructed using HMM for the prediction purpose. The major
focus of their works have been classify words in some class without considering
morphological information.
Wakshum Temesgen(Wakshum, 2017) recommended that the language model built from
stem forms, and morphological feature such as tense, case, person, gender and number are
used. Accordingly, the model set out to suggest the next word to be typed by a user in three
phases. Firstly, most probable stem forms are predicted using language model. Secondly,
morphological features are predicted for the proposed stem forms. Lastly, the proposed root

30
or stem word and morphological features are used by morphological synthesizer to generate
appropriate surface words
Ashenafi Bekele(Delbeto, 2018b) conducted a research to develop Afaan Oromo Word
Sequence Prediction using n-gram model. The designed model is evaluated based on
developed prototype. Keystroke Saving (KSS) is used to evaluate systems performance.
According to the evaluation, the primary word-based statistical system achieved 20.5% KSS,
and the second system that used syntactic categories with word-statistics achieved 22.5%
KSS. Therefore, statistical and linguistic rules have good potential on word sequence
prediction for Afaan Oromo.
3.3. Summary
In this Chapter, we have discussed works related to word sequence prediction for different
languages. A word completion study specifically targeted for online handwriting recognition
of Amharic language and done using pure frequency based method is also presented. This
approach is very challenging for inflected languages due to large possibility of word forms.
Therefore this research aims to fill the unattained gap in the existing work so that words can
be proposed in the correct morphological form by considering context information,
morphological features, and linguistic rules. User interface, prediction module, and linguistic
resources are main components of word prediction systems where the linguistic resource
embraces statistical or other information depending on the target language. From the
reviewed works, we also learnt that considering only frequency of words is not enough for
inflected languages, root or stem words, POS, and morphological features can be treated
separately, incorporating context information increases effectiveness of prediction output, and
CRF models have good capacity to capture context information.

31
CHAPTER FOUR
WORD SEQUENCE PREDICTION MODEL FOR
AFAAN OROMO LANGUAGE
This Chapter presents details of the Afaan Oromo Word Sequence Prediction Model.
Architecture of the proposed Word Sequence Prediction Model and its components with their
respective algorithms are described in this Chapter. CRF statistical language model is applied
to offer most expected root or stem words, and morphological features like aspect, case,
tense, and voice are used to inflect the proposed root or stem words to appropriate word form.
The Afaan Oromo Word Sequence Predictor accepts user„s input, extract root or stem word
and required features by analysing a user„s input, propose the most likely root or stem words
with their most probable features and finally generates surface words using the proposed root
or stem words and features.
4.1. Architecture of Afaan Oromo WSP Model
The model shown in Figure 4.1 is designed to predict words a user intends to type by
considering previous history of words. Constructing Language Model and Generation of
Predicted Words are the two major parts. First the training corpus is morphologically
analysed using morphological Analyser. Subsequently, using the morphologically analysed
corpus we built a tagged training corpus.
Then, language models like root word sequences and root word with features are built based
on the tagged training corpus. Morphological Analysis of User Input, Word Sequence
Prediction, and Morphological Generation are key components of the Generation of Predicted
Words part. Here, a user„s input is accepted and analysed using morphological analyser.
Subsequently, root and morphological features of words are extracted so that the word
prediction component uses this information to propose words by interacting with the
language model. Finally the morphological generator produces surface words to the user
given root and feature words proposal.

32
Figure 4.1: Architecture of AOWSP Model
4.2. Morphological Analysis of Corpus
This model analyser words in a given training data to identify root or stem form and
component morphemes so that required features and root or stem word are extracted to build
a tagged corpus. This tagged corpus is used to construct CRF models. A corpus is a large
collection of written or spoken material in machine readable form which can be employed in
linguistic analysis and is the main knowledge-base. Language models built from large
corpora tend to perform better, particularly for words that are infrequent. Word prediction
task requires a large size of corpus in order to have sufficient statistical information for
training the system.
In this study, text collection containing 226,252 word(s) which is gathered from newspaper,
and social media. Morphological analysis is the process of assigning each word found in a
corpus to their morphemes which can be morphological features, root/stem, pos. It is useful to
annotate words to their root form and other required morphological information.
Morphological analyser is a program used to analyser a separate word or words in a file to

33
their component forms. Afaan Oromo is a morphologically rich language as described in
previous chapters. A verb lexeme can appear in more than 100,000 word forms (Bickel et al.,
2005), and it is impractical to store all forms of words in probabilistic models. For this
reason, the training corpus is pre-processed to hold only the root or the stem and selected
morphological features of words. Features are selected by studying structure of Afaan Oromo
words and method of producing variety of words from the base word.
Through morphologically analysed training corpus, a tagged corpus consisting only root or
stem form, affixes, aspect, voice and tense is constructed. However, words that cannot be
analysed using morphological analyser are taken as they are, to keep consistency of root or
stem word sequences.
Algorithm4.2. to Build a Tagged Corpus
BEGINNINING
INPUT training Corpus
ANALYZE training-corpus using Morphological Analizer and WRITE in analyzed-corpus
INITIALIZE keywords for prefix, root Word, suffix, aspect, case, tense, voice, new
Word
INITIALIZE prefix, root, suffix, aspect, case, tense, voice, value to 0,newWord,
New Word2 to FALSE
READ morphologically-analyzed-corpus
FOR each line in morphologically-analyzed-corpus:
ADD each word in the line to a list
FOR each word in the list
IF word is in new Word key word and newWord2 is FALSE
SET new Word to TRUE
ELSE IF new Word is TRUE
New Word=FALSE
newWord2=TRUE
root Word=word
ELSE IF new Word is TRUE and word is in prefix Keyword:
prefix=word
ELSE IF new Word is TRUE and word is in suffix Keyword:
suffix=word
ELSE IF new Word is TRUE and word is in aspect Keyword:

34
aspect=word
ELSE IF new Word is TRUE and word is in voice Keyword:
voice=word
ELSE IF new Word is TRUE and word is in case Keyword:
case=word
ELSE IF new Word is TRUE and word is in tense Keyword:
tense=word
ELSE IF word in new Word key word and newWord2 is TRUE
WRITE(root Word+'^'+prefix'^'+suffix+'^'+case'^'+tense
+'^'+aspect+'^'+voice) on tagged-training-corpus
SET newWord2 to FALSE and new Word to TRUE
OUTPUT tagged-training-corpus
END
4.3. Building CRF Model
Language model is a storage consisting of statistical information which serves as a
knowledge base when predicting suitable words. The word sequence prediction task is
accomplished in two phases. In phase one, root or stem form of words are suggested using
root or stem CRF models. In the next phase, morphological features of proposed root or stem
words are predicted using statistical methods. The proposed root or stem word and features
are used later while generating appropriate surface words. Therefore building CRF model is
one of the main components of our word sequence prediction model. Statistical models of
root or stem word sequences and morphological features are constructed using the tagged
corpus.
4.3.1. Root or Stem Words with Aspect
Conditional Random Field(CRF) model of root or stem words with their respective aspect is
constructed by extracting and counting occurrence of unique root or stem word with its aspect
sequence. This model stores frequency of each root word with its aspect. Aspect of a verb can
be simplex, reciprocal, or iterative. The most frequent aspect for a particular root or stem
word is used when producing surface words. Algorithm 4.3.1 describes an algorithm to
construct root and aspect CRF model.
Algorithm 4.2 Algorithm to Construct Stem and Aspect CRF model

35
BEGIN
INPUT tagged-training-corpus
FOR each word in tagged-training-corpus:
SPLIT each word by „^‟ and ADD each item to a list
EXTRACT root and aspect using the item having „0‟ and „4‟ index from the list,
WRITE root-aspect-sequence in a file
READ root-aspect-sequence file
FOR each root-aspect-sequence in the file
ASSIGN frequency=0
IF root-aspect-sequence is new
COUNT root-aspect-sequence and ASSIGN it to frequency
WRITE root-aspect-sequence and frequency in a file
OUTPUT root-with-aspect CRF model
END
4.3.2 Root or Stem Words with Voice
Unique occurrence of root or stem words with their respective voice is counted from the
training corpus to build root or stem word and voice CRF model. This model stores frequency
of each root or stem word with its respective voice. The voice can be simplex, transitive, or
passive. The most frequent voice for a particular root or stem word is used when suggesting
most probable features for a given root or stem word.
Algorithm 4.3 to construct CRF model for stem and voice
BEGIN
EXTRACT root and voice using the item having „0‟ and „5‟ index from the list,
WRITE root-voice-sequence in a file
READ root-voice-sequence file
FOR each root-voice-sequence in the file
ASSIGN frequency=0
IF root-voice-sequence is new
COUNT root-voice-sequence and ASSIGN it to frequency

36
WRITE root-voice-sequence and frequency in a file
OUTPUT root-with-voice CRF model
END
4.3.2 Root or Stem Words with Prefix
CRF statistical information is built for three consecutive root or stem word sequences where
the last root or stem word is taken with its prefix. This model stores frequency of successive
root or stem words with prefix. This information is used to predict the most probable prefix
for suggested root or stem words so as to produce suitable surface words. Algorithm 4.3.2
shows the algorithm to construct root or stem and prefix CRF model.
BEGIN
FOR each sentence in tagged-training-corpus
ADD each word in tagged-training-corpus to a list,words
FOR i in RANGE 0 to length of the list(words)
WRITE(words[i][0],words[i+1][0],words[i+2][0],words[i+2][1])
in rootprefix-sequence//index „0‟ is for root word and index „1‟ is for prefix
READ root-prefix-sequence file
FOR each root-prefix-sequence in the file
ASSIGN frequency=0
IF root-prefix-sequence is new
COUNT root-prefix-sequence and ASSIGN it to frequency
WRITE root-prefix-sequence and frequency in a file
OUTPUT root-with-prefix CRF model
END
Algorithm 4.4 to construct Stem with Prefix CRF Model
4.3.3 Root or Stem Words with Prefix and Suffix
Frequencies of each root or stem word with its respective prefix and suffixes are identified
and kept in its own repository. Based on this information, the most likely suffix for a given
root or stem and prefix is predicted. The proposed suffix is used by morphological generator
while producing surface words.
BEGIN

37
SPLIT each word by „^ [] and ADD each item to a list
EXTRACT root, prefix and
Suffix using the item having „0[], „1[] and „2[] index
from the list,
WRITE root-prefix-suffix-sequence in a file
READ root-prefix-suffix-sequence file
FOR each root-prefix-suffix-sequence in the file
ASSIGN frequency=0
IF root-prefix-suffix-sequence is new
COUNT root-prefix-suffix-sequence and ASSIGN it to frequency
WRITE root-prefix-suffix-sequence and frequency in a file
OUTPUT root-with-prefix-and-suffix CRF model
END
Algorithm 4.5. Describes an algorithm to construct this model.
4.3.4. Stem Words with Tense
In Afaan oromo verbs are morphologically the most complex, with many inflectional forms;
numerous words with other POS are derived primarily from verbs. Generation of
syntactically and semantically correct sentences requires appropriate choice among the
different forms of verbs. On the other hand, verb has a long distance depends with subject.
For instance, in simple sentence “inni kalessa dhufee”, here “inni” is subject that convey
grammatical information such as second person and singular. “kalessaa” is adverb even with
limited derivational form. “dhufee” is verb having number of inflectional forms and context
information representing tense, Voice, case, number and gender that requires to agree with
subject. Furthermore, most of the time in Afaan oromo, sentence is written in the form of
SOV order, so to predict next word, specially verb, we need to consider the subject of the
sentence. Thus, to extract the relationship between stem form and suffix that indicate tense.
we constructed a CRF model of stem form with tense. Here, frequency of each stem word
with its respective tense is constructed. Perfective, imperfective, gerundive, and imperative or
jussive are possible tense categories. Based on this information, the most likely tense
indicator suffix for a given stem will predicted.

38
BEGIN
EXTRACT root and tense using the item having „0‟ and „3‟ index from the list,
WRITE root-tense-sequence in a file
READ root-tense-sequence file
FOR each root-tense-sequence in the file
ASSIGN frequency=0
IF root-tense-sequence is new
COUNT root-tense-sequence and ASSIGN it to frequency
WRITE root-tense-sequence and frequency in a file
OUTPUT root-with-tense CRF model
END
Algorithm 4.6.for constructs a CRF model of stem form with tense
4.4. Morphological Analysis of User Input
This module analyzes words accepted from a user and extracts required morphological
features. Context information like, tense and aspect is captured from a user„s input to predict
appropriate morphological features for the coming stem word. Morphological Analysis is
used to analyze word inserted by the user. Thus, morphological features, POS, affix, and stem
form are stored in file.
4.5. Word sequence Prediction
Prediction module predicts the most probable stem words and their morphological features
using language models. Thus, Prediction words has two components, the stem forms
predictor and morphological feature predictor. The stem forms predictor predicts the most
probable stem forms.
This component would be estimate the most probable stem forms by computing the
probability of the user input word or words in stem words CRF model. CRF model predicts
stem word based on previous single word from current position, whereas CRF predicts stem
word based on preceding words. CRF model predicts the next word by considering preceding
next words.

39
Finally, the stem forms predicted produces a list of proposed most probable stem forms.
The second component, the morphological feature predictor will predict the probable
morphological features for list of proposed stem forms produces by stem form predictor. This
means, each stem forms in prediction list checked for the most frequent morphological
features in the language model. Finally, the most frequent morphological features represented
in a way that the morphological generator can understand it.
4.6. Surface Words
Surface word is a morphologically suitable word that the user intends to type. Surface words
are offered to a user using proposed root or stem words, POS, Affix, morphological features
obtained from described earlier. Algorithm 4.7 presents an algorithm to produce appropriate
surface words.
INPUT proposed-root-words list
READ proposed-affix-features
READ proposed-morphological features list
FOR each word in proposed-root-words list
IF proposed root or stem word
CALCULATE features using proposed affix, morphological features, POS
GENERATE surface-word given root-word and features
ADD generated-word to proposed-surface-words list
OUTPUT proposed-surface-words 16 likely next words list
END
Algorithm 4.7. To produce appropriate surface words.
BEGIN

Word Sequence Prediction for Afaan Oromo Using CRF

Word Sequence Prediction for Afaan Oromo Using CRF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Word Sequence Prediction for Afaan Oromo Using CRF

Similar to Word Sequence Prediction for Afaan Oromo Using CRF (20)

Recently uploaded

Recently uploaded (20)

Word Sequence Prediction for Afaan Oromo Using CRF