SlideShare a Scribd company logo
1 of 92
Download to read offline
DSpace Institution
DSpace Repository http://dspace.org
Computer Science thesis
2020-05-21
AUTOMATIC SPELLING CHECKER
FOR AMHARIC LANGUAGE
TILAHUN, MELAKU
http://hdl.handle.net/123456789/10846
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POSTGRADUATE
STUDIES
FACULTY OF COMPUTING
AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE
MELAKU TILAHUN ASRESS
BAHIR DAR, ETHIOPIA
OCTOBER 16, 2017
i
AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE
MELAKU TILAHUN ASRESS
A thesis submitted to the school of Research and Graduate Studies of Bahir Dar Institute
of Technology, BDU in partial fulfillment of the requirements of the degree
Of
Master in computer Science in faculty of computing
Advisor Name: Dr.Tesfa Tegegne
Bahir Dar, Ethiopia
October 2017
ii
DECLARATION
I, the undersigned, declare that the thesis comprises my own work. In compliance with
internationally accepted practices, I have acknowledged and refereed all materials
used in this work. I understand that non-adherence to the principles of academic
honesty and integrity, misrepresentation/ fabrication of any idea/data/fact/source will
constitute sufficient ground for disciplinary action by the University and can also
evoke penal action from the sources which have not been properly cited or
acknowledged.
Name of the student______________________________ Signature _____________
Date of submission: ________________
Place: Bahir Dar
This thesis has been submitted for examination with my approval as a university
advisor.
Advisor Name: __________________________________
Advisor’s Signature: ______________________________
iii
© 2017
MELAKU TILAHUN ASRESS
ALL RIGHTS RESERVED
iv
Bahir Dar University
Bahir Dar Institute of Technology-
School of Research and Graduate Studies
Faculty of computing
THESIS APPROVAL SHEET
Student:
Melaku Tilahun Asress __________________________ ____________________
Name Signature Date
The following graduate faculty members certify that this student has successfully presented the
necessary written final thesis and oral presentation for partial fulfillment of the thesis
requirements for the Degree of Master of Science in computer science.
Approved By:
Advisor:
Dr.Tesfa Tegegne _____________________ ____________________
Name Signature Date
External Examiner:
Dr.Adane Letta_ _______________________ ____________________
Name Signature Date
Internal Examiner:
________________ _____________________ ____________________
Name Signature Date
Chair Holder:
___________________ _______________________ ____________________
Name Signature Date
Faculty Dean:
___________________ _______________________ ____________________
Name Signature Date
v
To my mother, father and my wife
vi
ACKNOWLEDGEMENTS
I would like to acknowledge my gratitude to my advisor Dr.Tesfa Tegegne for his Best
advising and supporting throughout the completion of the thesis, and I would like to
acknowledge my gratitude to Mr.Mekonnen Fentaw for his willingness and support the
open areas and advising how my research work will be going on, and also acknowledge
my gratitude to Mr.Belisty Yalew, Mr.Fentahun Mekuriaw, Mr.Bawoke Wondem and
Mr.Elias wondemagegn for their professional guidance and assistance. I would like to
acknowledge my gratitude to all my colleagues at the Department of Computer Science for
their cooperative. Finally I would like to acknowledge my gratitude to my wife Yirgalem
tadesse for her support and to give me time to accomplish the research work.
vii
ABSTRACT
In different government and non government organizations document preparation is one of
the tasks in day to day activities. A spelling error can occur when people use text processing
application to produce electronic documents. There are some works except on internal
inflection of words and repeated words which is unsatisfactory for a language having
complex morphology. Due to this reason, it is common to find various Amharic books and
newsletters that are published with misspelled words. In this study, an attempt has been
made to design and implement spell checker for Amharic language that works on inflection
of Amharic words (internal inflection, inflection by duplication of Amharic words also part
of this study). The design of our study has 5 components, namely, Input component,
normalization Component, error detection component, morphological analyzer component,
and spelling error correction and suggestion component.
The system has been evaluated with four sets of data. The first and the second sets of data
taken from Amhara national regional state, science technology, and information
communication commission 2009 annual report. The third set of data taken from afar
Region ICT 2009 annual report. The fourth set of data taken from Harari Region ICT 2009
annual report. The performance of the system is evaluated using precision and recall.
Finally the system evaluated using 5 experiments and we got 97.27% overall performance
of the system. As are commendation, Detection and correction of real word errors, the
performance of spelling error detection and correction algorithm, which is edit distance,
need to be compared with other identified spelling error correction techniques, integrating
this work with other Amharic NLP works like, automatic spelling error correction and
suggestion.
TABLE OF CONTENTS
DECLARATION ................................................................................................................. i
viii
ACKNOWLEDGEMENTS............................................................................................... vi
ABSTRACT......................................................................................................................vii
LIST OF ABBREVATIONS .............................................................................................. x
LIST OF FIGURES ........................................................................................................... xi
LIST OF TABLES............................................................................................................xii
CHAPTER ONE................................................................................................................. 1
1. INTRODUCTION....................................................................................................... 1
1.1. Background............................................................................................................. 1
1.2. Motivation............................................................................................................... 2
1.3. Statement of the Problem........................................................................................ 4
1.4. Objective of the Study ............................................................................................ 5
1.4.1. General Objective ............................................................................................. 5
1.4.2. Specific Objectives ........................................................................................... 6
1.5. Scope and limitation of the Study........................................................................... 6
1.5.1. Scope of the Study ............................................................................................ 6
1.5.2. Limitation of the Study..................................................................................... 6
1.6. Significance of the Study........................................................................................ 7
1.7. Research Methodology ........................................................................................... 7
1.7.1. Literature Review.............................................................................................. 7
1.7.2. Data Collection and Preparation ....................................................................... 8
1.7.3. Implementation tools ........................................................................................ 8
1.7.4. Performance Evaluation techniques.................................................................. 8
1.7.5. Organization of the thesis ................................................................................. 9
CHAPTER TWO .............................................................................................................. 10
LITERATURE REVIEW AND RELATED WORKS..................................................... 10
2.1. Literature Review.................................................................................................. 10
2.1.1. Amharic text spell checker.............................................................................. 10
2.1.2. Types of Spelling Errors ................................................................................. 11
2.1.3. Core functionalities of spell checkers............................................................. 12
2.1.4. Spelling checker tools..................................................................................... 17
2.2. Related works........................................................................................................ 18
ix
2.2.1. Nepali Spell Checker ...................................................................................... 19
2.2.2. Spelling Checker for Afaan Oromo Language ............................................... 20
2.2.3. Spell Checker for Bangla................................................................................ 21
2.2.4. Spell Checker for Arabic language................................................................. 22
2.2.5. Spelling Checker for Amharic Language ....................................................... 22
CHAPTER THREE .......................................................................................................... 25
DESIGN AND DEVELOPMENT OF AMHARIC LANGUAGE SPELLING CHECKER
........................................................................................................................................... 25
3.1. Amharic Language Spelling Checking ................................................................. 25
3.1.1. Amharic Language Inflection ......................................................................... 25
3.1.2. Amharic spelling error patterns ...................................................................... 29
3.1.3. Affix Rules Development ............................................................................... 31
3.1.4. Dictionary development.................................................................................. 32
3.1.5. Lexicon lookup ............................................................................................... 34
3.2. Design of Amharic Spelling Checker (AMSPCH)............................................... 34
3.2.1. Design Requirements...................................................................................... 35
3.2.2. Architecture of the Amharic Spell Checker.................................................... 35
CHAPTER FOUR............................................................................................................. 41
EXPERIMENT, RESULT AND DISCUSSION OF AMHARIC SPELL CHECKER ... 41
4.1. Introduction............................................................................................................ 41
4.2. Prototype ................................................................................................................ 41
4.2.1. Input word processing ..................................................................................... 41
4.2.2. Implementation of spell checker in Open Office using Hunspell................... 42
4.3. Experiment result and Discussion.......................................................................... 44
4.3.1. Evaluation Criteria........................................................................................... 44
4.3.2. Experiment....................................................................................................... 45
4.3.3. Discussion........................................................................................................ 51
CHAPTER FIVE .............................................................................................................. 54
CONCLUSIONS AND RECOMMENDATIONS ........................................................... 54
5.1. Conclusions............................................................................................................ 54
5.2. Recommendations.................................................................................................. 55
x
REFERENCE.................................................................................................................... 56
APPENDIX....................................................................................................................... 59
Appendix A: Sample of Amharic words taken for experiment ........................................ 59
Appendix B: Amharic alphabets with their seven orders ................................................. 60
Appendix C: Prefix, Infix and suffix lists used in this thesis work. ................................ 62
Appendix D: sample screen shote..................................................................................... 68
Appendix E: Prefix and suffix lists used in this thesis work. ........................................... 72
Appendix F: Min Edit Distance Algorithm ...................................................................... 75
Appendix G: Steps we followed for configuration, compilation, and execution of
Hunspell............................................................................................................................ 76
Appendix H: Word counter Python code.......................................................................... 77
LIST OF ABBREVATIONS
xi
BDU - Bahir Dar University
NLP – Natural Language Processing
OCR – Optical Character Recognition
OOo – OpenOffice.org
OS – Operating System
POS – Part of Speech
ANRS- Amhara National Regional State
STICC- Science Technology and information communication commission
AMSPCH- Amharic spelling checker
LIST OF FIGURES
FIGURE2.1 ARCHITECTURE FOR NEPALI SPELL CHECKER [6]......................................19
xii
FIGURE 3.1 SAMPLE LIST OF DICTIONARY.........................................................................33
FIGURE 3.3 ARCHITECTURE OF AMHARIC SPELL CHECKER ADOPTED FROM [3]....37
FIGURE 3.4 ALGORITHM FOR INPUT COMPONENT ADOPTED FROM [3] .....................38
FIGURE 3.5 ALGORITHM FOR MORPHOLOGICAL ANALYSIS ADOPTED FROM [8]....40
FIGURE 4.1 SAMPLE AFFIX RULE ........................................................................................43
FIGURE 4.3 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 1........................................47
FIGURE 4.4 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 3......................................49
FIGURE 4.5 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 4.......................................50
FIGURE 4. 6 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 6.......................................54
LIST OF TABLES
TABLE 3.1 INFLECTION OF NOUNS BY ADDING SUFFIX “ኦች”.......................... 25
xiii
TABLE 3. 2 INFLECTION OF NOUNS BY ADDING SUFFIXE “ዎች” ...................... 26
TABLE 3. 3 INFLECTION OF NOUNS BY REDUPLICATION.................................. 26
TABLE 3. 4 INFLECTION OF VERBS.......................................................................... 27
TABLE 3. 5 INFLECTION OF TRANSITIVE ............................................................... 27
TABLE 3. 6NEGATIVE INFLECTION OF VERBS...................................................... 28
TABLE 3. 7 INTERNAL INFLECTION ......................................................................... 28
TABLE 3. 8 AMHARIC PUNCTUATION MARKS...................................................... 37
TABLE 4. 1 EVALUATION RESULT FOR EXPERIMENT 1 ..................................... 47
TABLE 4. 2 EVALUATION RESULT FOR EXPERIMENT 2. .................................... 48
TABLE 4. 3 EVALUATION RESULT FOR EXPERIMENT 3 ..................................... 48
TABLE 4. 4 EVALUATION RESULT FOR EXPERIMENT 4 ..................................... 50
TABLE 4. 5 EVALUATION RESULT FOR EXPERIMENT 5 ..................................... 51
TABLE 4. 6 EXPERIMENT RESULT SUMMERY....................................................... 52
TABLE 4. 7 AVERAGE PERFORMANCE CALCULATED FROM OVERALL
PERFORMANCE OF EACH EXPERIMENT ......................................................... 52
TABLE 4. 8 EVALUATION RESULT FOR EXPERIMENT 6 ..................................... 53
1
CHAPTER ONE
1. INTRODUCTION
1.1. Background
Amharic language is the official language of Federal Democratic Republic of Ethiopia and
which has a population of over 92.21 million, 21.6 million native Amharic language
speakers, 4 million secondary Amharic language speakers, 3 million emigrants outside of
Ethiopia speak Amharic language. Total of 28.6 million peoples speak Amharic language
[2].
Amharic language users use Amharic scripts for document preparation in daily bases. But
spelling error is one of the problems. To minimize the spelling error problem, spelling
checker tools are used in text processing applications. Spelling checkers have become
essential parts of any text processing application software. Different types of spelling
checker applications are implemented in text processing tools using different languages.
Most commercially available word processors has a spell checker, a grammar checker and
even a word list lookup facility as essential part for several languages such as English,
French, Portuguese, Spanish, Arabic, etc [3].
In most of African languages no spellcheckers exist and for those languages which have
spellcheckers, the adequacy of the actual use is questionable [4]. This is also true for
Amharic language which is the official language of the Federal democratic of Ethiopia [4].
When People use spelling checker during document preparation, they save money, time
and they can produce better quality and acceptable document. For example, the number of
spelling mistakes in English newspapers has dropped considerably by using text processing
tools with spelling correctors [5].
Amharic language users are not benefiting from the use of spelling checking and correcting
tools. Because, text processing tools do not integrate with spelling checker and spelling
2
correctors. As a result it requires excessive effort and man power to minimize misspelled
words in a written document(Newspapers, books, reports, plans, and different
publications). So spell checker is important in saving time, money wastage and produce
quality document. It also reduces dangerous consequences of mistyped electronic texts
such as courts, health, military and other related cases.
1.2. Motivation
Different spell checking and correcting techniques developed and implemented for
languages such as English, Arabic, Bangla, and so on. But a spell checker and corrector
tool developed and implemented for one language cannot be applied to others directly.
Because spell checkers are dependent on the characteristics of the language. Hence,
specific spell checkers are available for English and some other Latin script based
languages [6]. Existing word processing tools support language specific utilities like
grammar checking, vocabulary, lexicon, translators, etc for many of the languages.
However the absence of spell checker and corrector tool for Amharic language has made
document preparation activities difficult, and needs excessive effort to edit and correct
documents, reduce documents quality and time wastage. As a result, it is common to see
spelling errors in Amharic newspapers and published documents. For example, Figure 1.1
is taken from published document [7] which has up to 9 misspelled words in a single page.
3
Figure 1.1. Sample Amharic text taken from a published document
To reduce the mistyped errors some research work has been done using Hunspell for
uninflected and inflected Amharic words in Linux OS environment [1, 6]. However, in the
previous works, Internal inflected and repeated words are not considered. For example:
ገጣጠመ, ሰባበረ, ቆራረጠ, ጌጣጌጥ and so on.
4
1.3. Statement of the Problem
Most of the government and private sectors of federal democratic republic of Ethiopia use
Amharic scripts for document preparation and a lot of documents prepared in day to day
activities, among the problem verifying or edit documents written by the worker or
someone else has written.
In English language computers have considerably minimized this problem since they
automatically detect and correct spelling as well as grammatical mistakes. Because of this
writers not only save considerable amount of time and money but they have also started
relatively producing better documents. So the number of spelling mistakes in English
newspapers has decrease considerably because of the use of automatic spelling correctors
[5].
Nowadays there is no applicable Amharic text spell checker integrated with any test
processor tools. As described in [4] one of the reason is lack of standardization and complex
morphology for Amharic language and Absence of clearly defined spelling rules for
Amharic language, Amharic language that has the same alphabets for the same sound are
reasons for Amharic language not to be developed.
There is no Amharic text spell checker tool or software which has the following features:
1. Amharic spelling checker system with internal inflection,
2. Amharic spelling checker that can check spelling on the fly,
3. Usable application of Amharic spell checker,
4. Spell checker program with exhaustive rules incorporated with repeated words,
Therefore, it is very indispensible to develop a spelling checker and integrate to open Office
word processor that satisfies the above criteria. Additionally Language specific problems
such as lack of standardization can be solved temporarily by using available resources such
as dictionary and Amharic books in the development of the project.
5
Due to this problem Amharic language users preferring English language rather than
Amharic language. Shewangizaw [4] and Mekonnen [5] developed Amharic language
spelling checker. Gaddisa [8] has morphology based Afaan Oromo spelling checker, all
[1,5,8] using Hunspell tool but internal inflection was not considered and also the rule was
not exhaustive for example compound words such as ጋሻ-አጃግሬ, እራስ-አገዝ. Both Mekonnen
[5] and Shewangizaw [4] did not consider internal inflection, repeated words, exhaustive
rules (compound words) rather it was a future work. So, this research basis on the above
works mentioned try to exhaust compound words, incorporate internal inflection, repeated
words and finally designing and implementing Amharic language spelling checker
(AMSPCH) based on the previous related works.
Thus, this study tries to address the following questions:
 What is the suitable tool and algorithm for designing a system?
 What is the performance of the system?
1.4. Objective of the Study
The following are general and specific Objectives of the study.
1.5. General Objective
The general objective of the study is to design and develop automatic Amharic language
spelling checker for open office word processor.
6
1.5.1. Specific Objectives
To achive the general objective, the following specific objectives are accomplished.
 Review different related documents to understand the concept, identify the gap and
study on Amharic languages structure of words and their derivations,
 Develop Amharic root word dictionary,
 Explore Amharic word formation rules from a root word,
 Design and develop the prototype,
 Evaluate the performance of the system.
1.6. Scope and limitation of the Study
1.6.1. Scope of the Study
The study attempts to collect and analyze different related documents, design and
implement the Amharic language spelling checker by considering internal Inflection of
words, repeated words, and compound words. Finally integrate to open office word
processor and study the performance of the system. However real-word error checking and
correction is out of the study (i.e. error checking and correction using contextual
information is out of the study).
1.6.2. Limitation of the Study
In this study, spelling error correction techniques in other languages were investigated.
However, due to time constraint we will not consider automatic suggestion and correction
of words, we use Levenshtein edit distance spelling error correction technique, which is
implemented in Hunspell, is adopted. Affix rules work only the first 65535 Unicode
characters is the limitation of Hunspell.
7
1.7. Significance of the Study
From this study, Government organizations, students, journalists, teachers and basically
anyone who uses Amharic language to prepare document will be beneficiary. This work
will have a lot of significance in different areas.
Some of them are listed below:
 For press company in preparation of Amharic Books, Journals, newspapers and etc
 For teaching learning processes in preparation of lecture notes, handout, reports,
assignments
 For business organizations in preparations of their routine and regular reports, planes
etc.
 For governmental organization in preparation of rules and regulations, and etc
 For anyone who want to write a sensitive report in avoiding or reducing dangerous
consequences. (courts, governments, agreement)
1.8. Research Methodology
This study is experiment based and considered the following methodology and tools.
1.8.1. Literature Review
For proper understanding of the problem and successful completion of this study, different
global and local relevant literatures such as Journal articles, conference papers, reports,
books, manuals and relevant resources from internet reviewed to achieve the study
objectives. The study was done based on previous research works and literatures related
with Amharic language spelling checker. In this study, we reviewed different types of
spelling checker tools to identify pros and cons.
8
1.8.2. Data Collection and Preparation
Since there is no readymade Amharic root word dictionary for Amharic language spelling
checker (AMSPCH), different Amharic electronic documents were collected and studied
to analyze errors encountered in Amharic documents which are helpful to characterize
different types of spelling errors. The Amharic spell checker’s word list was built by
combining Amharic dictionary, lists of some common names in Amharic, list of Ethiopian
person names, list of common places in Ethiopia, list of abbreviations and lists of some
countries in the world were collected from different books and research works.
1.8.3. Implementation tools
For this study we use the following off the shelf components.
 Amharic Unicode fonts used to type Amharic text
 Open Office word processor: though Amharic spell checker can be integrated to
closed proprietary word processor such as Microsoft word, we choose the open
office word processor because of accessibility of tools and codes.
 Hunspell tools: it is a spell checker and morphological analyzer library and program
designed for languages with rich morphology and complex word compounding or
character encoding.
 Cygwin terminal tool is used for interfacing.
 Word counter python code is used to count number of words in the dataset; it is
shown in appendix H.
1.8.4. Performance Evaluation techniques
To measure the performance of the new system recall and precision were taken as major
criteria. The test data was collected from different region annual report document.
Moreover, valid uninflected and inflected Amharic words were used in addition to
misspelled Amharic words.
9
1.8.5. Organization of the thesis
The reminder of the thesis is organized as follows:
Introductory part gave an overview of background and statements of the problem for the
study, objectives, scope and limitations of the study, significance of the study and
description about the methodology to conduct the study.
Chapter two of this study talks about literatures and related works reviewed and provides
background information about how spell checker works and types of spelling errors and
Related works are that describe spell checker works done for languages like Nepali,
Bangla, Arabic, and Amharic.
Chapter 3 provides information about Amharic language with its writing system, and
design requirements for Amharic spell checker and the architecture of the designed spell
checker. Chapter 4 deals with the experiments conducted to evaluate the performance of
the spell checker and discuss the obtained results. The last chapter, chapter 5, presents the
overall conclusions that have been drawn from the studies reported in this thesis work.
Finally, recommendations are given and areas open to future research are also identified
and presented in this chapter.
10
CHAPTER TWO
LITERATURE REVIEW AND RELATED WORKS
In this chapter theoretical concept and types of spelling checker, core functionalities of
spelling checking system, spelling checking related works and finally approaches used in
developing spelling checker system discussed.
2.1. Literature Review
2.1.1. Amharic text spell checker
A spell checker is a tool that enables us to check the spellings of the words in a text file,
validates and checks whether they are rightly or wrongly spelled and in case the spell
checker has doubts about the spelling of the word, finally suggests possible alternatives
[8].
Spell checker operates on a single word at a time. It is either dictionary based or rule based,
dictionary based spell checker can be designed in two ways. In the first case, a dictionary
contains all root words and their inflection forms. Thus, it is not suitable for languages that
have rich morphology such as Amharic and Arabic languages. Amharic language has rich
morphology, and spell checkers should be able to handle high inflection of words. But it is
easy to develop Amharic spell checker using uninflected word collections. However, as
stated in [4], it has less performance and high memory consumption. In the second case, a
dictionary contains only root words. This one has better performance and memory
consumption. In spelling checker Stemming is important to develop root word dictionary
from an existing electronic dictionary. It is the process of reducing morphological variants
of a word into a common form particularly by removing prefixes and suffixes. Affix and
dictionary of Amharic words can be in Ethiopic script or Unicode data.
11
2.1.2. Types of Spelling Errors
There are two types of spelling errors; it can be real word spelling error or non-word
spelling error [9].
 Real word spelling errors
In Real word spelling error a word is correctly spelled but not contextually correct [10].
That means in real word spelling error, it is impossible to decide whether a word is wrong
or not without some contextual information. Spelling errors that result in a token, which is
a correctly spelled word, though not the ones that the user intended [11], are real word
spelling errors.
 Non word spelling Error
Non-word spelling errors occur when the user writes misspelled word or typed incorrectly
[12]. In our research work, we focus on non-word spelling errors. As stated in [13]non-
word errors mainly classified into typographic and cognitive errors.
Typographic Errors
Typographic errors occur when writer knows the correct spelling of the word but mistypes
the word by mistake (for example, ሞራረደ vs. ሞረራደ). These errors are mostly related to the
keyboard shift key.
As stated in [1, 5] and, Typographic errors are classified in to four major types, such as
substitution error, deletion error, insertion error, and transposition error.
These error types can be of multi error misspellings and single error misspelling. Multi-
error misspellings are errors that contain more than one instance of error, whereas, single
error misspellings are a single instance of an error in the given word. As stated in [14] the
majority of (80%) wrong spellings happen because of one of the following four categories:
12
 Single letter insertion, e.g. typing ነኢትዮጵያ for ኢትዮጵያ
 Single letter deletion, e.g. typing ኢዮጵያ for ኢትዮጵያ
 Single letter substitution, e.g. typing ኢተዮጵያ for ኢትዮጵያ
 Transposition of two adjacent letters, e.g. typing ኢዮትጵያ for ኢትዮጵያ
In most cases typographic errors are related to the keyboard adjacencies and the most
common typographic errors are substitution error types. This error type is mainly caused
by replacement of a letter by some other letter whose key on the keyboard is adjacent to
the correct letter’s key. As shown in Kukich [15] study, 58% of the errors involved adjacent
typewriter keys for English language.
Cognitive errors
Cognitive errors are also called orthographic errors [13], and it occurs when writer does
not know or has forgotten the correct spelling of a word. It is assumed that in the case of
cognitive errors, the misspelled word happens by missing the pronunciation of the correct
word especially in foreign languages used in Amharic languages (e.g., ኢንፎርሸሚን -
>ኢንፎርሜሽን, ኮርኮሬሽን ->ኮርፖሬሽን).
2.1.3. Core functionalities of spell checkers
Spelling error detection and spelling error correction are the two core functionalities of a
spell checkers. Error Detection is to verify the validity of a word in the language while
Error Correction is to suggest corrections for the misspelled or wrongly spelled word [16].
According to [14]study, interactive and automatic are types of Spelling error correction.
Interactive spellchecker can suggest more than one alternative correction for each error and
the user select one from the suggestion for replacement and in automatic correction, the
spellchecker decide and select one best correction and the error is automatically replaced
with misspelled word.
13
In automatic error correction is the requirement for those speech processing and Natural
Language Processing (NLP) related systems where human intervention is not possible [14].
The spell checking process can generally be divided into three steps, detecting errors,
finding correction and ranking correction. Detection and correction are discussed above;
Ranking is the listing of suggested corrections in decreasing order of their intended word.
2.1.3.1. Spelling Error Detection
Spelling Error Detection is verifying the validity of a word in the language and includes
identification of misspelled words and flagging of misspelled words using different
detection algorithms.
The two main approaches for non-word error detection are dictionary lookup and n-gram
analysis method [17].
Dictionary Lookup Technique
Dictionary lookup technique is used to check the presence of every input text word in
dictionary. If the word is present in the dictionary, then it is a correct word, otherwise it is
an error word or misspelled word. The most common technique for gaining fast access to
a dictionary is the use of a Hash Table. To look up an input string, one simply computes
its hash addresses and retrieves the word stored at that address in the pre-constructed hash
table. If the word stored in the hash address is different from the input string or is null, a
misspelling is indicated [18].
The challenges of this approach are:
A lexicon containing all correct words could be extremely large, resulting in a need of more
space and inefficiency searching time and for morphologically complex languages, it is
practically impossible to list all correct words. So, instead of storing the word as it is in the
lexicon, some sort of rules can be applied to reduce a given word into its root word. This
can be done by storing only root words in the lexicon including prefix, infix and suffix
information. Then, rules can be applied on the root words of the lexicon by using prefix,
infix and suffix information to generate derived words.
14
N-gram Analysis Technique
The N-gram analysis or independent spelling error detection method does not use a wordlist
or lexicon; instead it uses statistical means to detect misspelled words [4].
This method works by using a large corpus of text from the desired language and by
generating a character n-gram from the list. An n-gram is calculated from this corpus. A
character n-gram is a sequence of characters where n is the number of letters in the
sequence.
One, two and three letter n-grams are often referred to as unigrams, bigrams and trigrams,
respectively. An example of a trigram analysis of the word ኢትዮጵያ would give the 3-gram
set {ኢትዮ, ትዮጵ, ዮጵያ}. By using this technique, strings that contain unusual n-grams can
be identified as possible spelling errors. N-gram techniques usually require a large corpus
or lexicon training data so that an n-gram table of possible combinations of letters can be
compiled.
According to [15], N-gram analysis technique is very useful for detecting errors occurred
in machine-generated texts such as texts generated by OCR. Its main advantage is that it
works without a lexicon. However, for human generated errors, most spell checkers rely
on dictionary lookup for error detection; and some applications use a hybrid of these two
methods. To use dictionary lookup technique, we need to be careful on the lexicon size and
usage of efficient lookup algorithm.
2.1.3.2. Spelling error correction
Non word spelling error correction is a process of detecting and providing suggestions for
incorrectly spelled words in a text. Spellchecker can suggest one or more corrections for
each error and the user selects the best word from the list and replaces the misspelled word.
Non word spelling error correction can be done without considering contextual information
which is called isolated word error correction [14].
15
Isolated word error correction approach is very helpful for handling non-word spelling
errors. In isolated word error correction approach, knowledge about error patterns is very
useful. Most misspellings are within one or two characters in length of the correct word.
While searching for the correct spelling, we do not usually need to look at words with
greater character length difference, especially more than two. Kukich [15] also mentioned
that the number of errors occurred at the beginning of a word is minimal. As the probability
of getting error in the first letter of a word is less, the process of error correction can be
speeded up by concentrating on the remaining letters of the word.
Generally Isolated Word Error Correction techniques can be divided into following
subcategories [14]:
1. Edit distance techniques
2. Similarity Key techniques
3. Probabilistic Techniques
4. N-Grams Based Techniques
5. Phonetics based techniques
Minimum Edit Distance Technique
Edit distance is a most effective technique to generate the alternates of wrongly spelled
words. In this approach word containing the spelling mistake is compared to every word in
the dictionary and various operations like insertions, deletions and substitution and
transposition are performed on the word corresponding to every word in the dictionary.
The total number of such operations is referred to as the distance. The minimum edit
distance is the minimum number of operations (insertions, deletions and substitutions)
required to transform one text string into another [19]. In its original form, minimum edit
distance algorithms require m number of comparisons between misspelled string and the
dictionary of n number of words [15]. After comparison, the words with minimum edit
distance are chosen as correct alternatives. Minimum edit distance has different algorithms
from this Levenshtein algorithm, Hamming, Longest Common Subsequence are included.
Similarity key technique
16
Similarity key technique is to map every string into a key such that similarly spelled strings
will have similar keys. Thus when key is computed for a misspelled string it will provide
a pointer to all similarly spelled words in the lexicon [4].
Rule Based Technique
Rule Based Techniques are algorithms that attempt to represent knowledge of common
spelling errors patterns in the form of rules for transforming misspellings into valid words.
The candidate generation process consists of applying all applicable rules to a misspelled
string and retaining every valid dictionary word those results [20].
Probabilistic Techniques
In this, two types of probabilistic technique have been exploited in to transition
probabilities they represent a given letter will be followed by another given letter and
confusion probabilities they estimates of how often a given letter is mistaken or substituted
for another given letter. Confusion probabilities are source dependent because different
optical character recognition (OCR) devices use different techniques and features to
recognize characters, each device will have a unique confusion probability distribution.
N-gram Based Techniques
Letter n-grams, including tri-grams, bi-grams and unigrams have been used in a variety of
ways in text recognition and spelling correction techniques. They have been used by OCR
correctors to capture the lexical syntax of a dictionary and to suggest legal corrections.
Phonetics based techniques
These techniques work on the phonetics of the misspelled string. The target is to find such
a word in dictionary that is phonetically closest to the misspelling [14].
17
2.1.4. Spelling checker tools
Ispell, Aspell, MySpell and Hunspell are some of open source spell checker tools integrated
with different open source word processors such as liberoffice, Open office [5].
Ispell is a spelling checker for UNIX that supports most Western languages
(English (United Kingdom), English (United States), French, German, and Spanish). It
offers several interfaces, including a programmatic interface for use by editors such as
emacs (it is a popular text editor used mainly on Unix-based systems by programmers,
scientists, engineers, students, and system administrators). Ispell only suggest corrections
that are based on a Levenshtein distance. It will not attempt to guess more distant
corrections based on English pronunciation rules. The generalized affix description system
introduced by ispell has been imitated by other spelling checkers such as MySpell [2].
Like most computerized spelling checkers, ispell works by reading an input file word by
word, stopping when a word is not found in its dictionary. Ispell attempts to generate a list
of possible corrections and presents the incorrect word and any suggestions to the user,
then choose a correction, replace the word with a new one, leave it unchanged, or add it to
the dictionary [5].
Another open source spelling checker tool is aspell. It is spell checker program designed
to replace Ispell. Its primary advantage over Ispell and other existing spell-checkers is the
suggesting of possible replacements for a misspelled word. Aspell has also the capability
to spell check UTF8 encoded documents without the use of an additional dictionary. Aspell
includes support for multiple dictionaries at once, which Ispell does not do. MySpell is a
spellchecker based on Ispell. MySpell is used by OpenOffice.org and Firefox/Mozilla and
works on both Windows and Linux [5].
Hunspell Spellchecker is the next generation of Myspell, has been improved in order to
support additional features for different languages, especially for Hungarian language, as
well as other languages such as German and Turkish [21].
In general Hunspell is a spell checker and morphological analyzer library and program
designed for languages with rich morphology and complex word compounding or character
18
encoding. Hunspell becomes attractive spell checker for many languages such as Amharic
language and Arabic language because of the following features [5]
 Unicode support,
 Morphological analysis and stemming,
 Support complex compounding,
 Support language specific features,
 Handle conditional affixes, circumfixes, forbidden words, pseudo roots and
homonyms,
 Free and open source software,
All Ispell, Aspell, MySpell and Hunspell uses a dictionary file (.dic) and affix file (.aff).
The dictionary (.dic) is a list of words with their corresponding affix rules.
The affix file describes each of the prefix, Infix and suffix based rules. Affix is a linguistic
element added to a word to produce an inflected or derived form. An affix can be placed at
the beginning (prefix), middle (infix), or end (suffix) of the root or stem of a word [21].
However, affix rules used in the spell checkers mentioned above are either prefix, infix or
a suffix rules. Amharic electronic text spell checker should include feature such as
analyzing the rich morphological structure of Amharic language and support of the
Unicode encoding. Hungarian spell checker, it is based on Hunspell is capable of analyzing
complex morphological nature of language and supporting Unicode encoding. So in our
study Hunspell spelling checker tool is used.
2.2. Related works
Different language specific spell checkers have been developed to improve spelling error
problems that are created due to specific nature of the language in document preparation
using word processors. In this portion we try to show some of those language specific spell
checkers. In addition, it tries to see sources of spelling error variations for Amharic
language.
19
2.2.1. Nepali Spell Checker
Keyboard adjacencies, shift key characters, phonetic similarity, and visual similarity are
indicated as the main causes of spelling mistakes in Nepali writing system [22].
Architecture of the Nepali spell checker as shown in Figure 2.1, Nepali spell checker has
three components namely: Morphological Preprocessing Module, Lexicon Lookup/Error
Detection Module and Suggestion Module.
Each module can be easily incorporated to develop a new spell checker for other languages
and also can be used to device new techniques and procedures for Nepali language.
Figure2.1 Architecture for Nepali spell checker [6]
Lexicon
Because the size of the lexicon is an important factor for the efficiency of a spell checker,
only Nepali root words are stored.
Error Detection Module
Error detection module deals with lookup of the input word in the lexicon. Token (Nepali
word) is input to the Error detection module. This module searches the input word in the
lexicon, if it is not found, it will be sent to morphological preprocessing module. In
addition, it accepts the word which is broken down by the morphological preprocessing
20
module and then searches it in the lexicon. If the root word is not found in the lexicon,
spelling error is detected. It then sends the word to the suggestion module for correction.
Morphological Preprocessing
Morphologically complex words are broken down into root words in this module, which
are then searched into the lexicon. To do this, the researchers used morphological rules to
reduce the size of the lexicon. Morphological preprocessing module uses a Nepali porter
stemmer to breakdown the morphologically complex words into roots and affixes.
Spelling Error Correction and Suggestion Module
The suggestion module receives token when spelling error is detected. For the purpose of
spelling error correction and suggestion, it uses the edit distance algorithm (more
specifically Levenshtein edit distance algorithm).
Evaluation
The researchers used lexical recall (which indicates the percentage of valid words correctly
accepted), error recall (that gives the percentage of invalid words correctly flagged),
precision (which indicates the percentage of correctly flagged words), and suggestion
adequacy (that indicates how adequate the correct suggestion is) as evaluation metric for
their spell checker.
2.2.2. Spelling Checker for Afaan Oromo Language
As stated in Gaddisa [8], Afaan Oromo is a Cushitic language family. It is an official
language of Oromiya regional state and it has a very rich morphology. The system is
designed in a dictionary look-up with morphological rules.
Morphological rules in Afaan Oromo Language address word categories and their possible
inflections, derivation and compounding.
21
The architecture has eight components: Tokenize, Knowledge base, Error detection,
Morphological analyzer, Error correction, Morphological generator, Suggestion ranker and
Word assembler. The system uses English characters, and the inflection of words different
from Amharic inflection of words. In the system Levenshtein Edit Distance algorithm used
to rank the suggestion. Finally the suggested word with the shortest distance to the
misspelled word is considered as the best suggestion.
The research used accuracy, performance, precision, and recall for evaluation of the spell
checker. The accuracy measures how high the prototype suggests for the generated errors.
The performance measures the efficiency for the prototype in terms of time it takes to
generate the correct suggested word. On the other hand, precision and recall measures the
number of correct suggestions in the total number of spelling suggestions [8].
2.2.3. Spell Checker for Bangla
According to [23], Phonetic similarity of Bangla characters, difference between the
grapheme representations and the phonetic utterance are the most common reasons for
spelling errors in writing Bangla language. To produce good suggestions for these spelling
errors, methods based on edit-distance and fuzzy string matching algorithms have been
done for the language.
The research work [24], done by Naushad UzZaman and Mumit Khan from BRAC
University, presents a double metaphone encoding algorithm for Bangla that can be used
by spell checkers to improve the quality of suggestions for misspelled words in the
language. The researchers presented how this encoding system effectively encapsulates the
complex rules for Bangla and dialectic pronunciation differences that are not possible to
handle by using the traditional edit-distance methods. They compared the proposed Double
Metaphone algorithm with the edit distance based methods in producing suggestions for
misspelled words.
22
2.2.4. Spell Checker for Arabic language
In 2003, Khaled Shaalan and Amin Allam [25], University of Cairo, developed an Arabic
morphological analyzer. In addition, they devised techniques for spelling error detection
and corrections for Arabic language by investigating common spelling errors in Arabic text
writing.
The researchers analyzed and classified common spelling errors in writing Arabic word as:
 Reading Errors: Such kind of spelling errors could occur when the writer inputs a
word from a written documents or visual similarity of some characters in the
language.
 Hearing Errors: such errors can occur when the human writer is being dictated
and he/she might recognize a character as another one. This might occur from
pronunciation differences.
 Morphological errors: errors in this category might be the result of nonnative
speakers of Arabic language or a non well-educated writer.
 Editing errors: these are the most common errors in other languages [4] like
insertions, deletions, substitutions, and transpositions.
Their study has mainly focused on spelling errors correction to isolated words. They
proposed the spelling correction method categorized as ‘add missing character’, ‘replace
incorrect character’, ‘remove excessive character’, and ‘adding a space’ to split a
misspelled word into two or more words.
In adding missed character, the spell checker adds a missing character in every possible
position. If the modified word matches a word in lexicon, the new word will be added to
the list of candidates. Similarly, in replacing incorrect character, their tool replaces every
character with one of its neighbors according to some rule. And, if a new word is found in
the lexicon it will be added to the list. For adding a space to split words, the tool adds a
space in every possible positions and the newly formed word will be added to the
suggestion list (when it is found in the lexicon).
2.2.5. Spelling Checker for Amharic Language
23
Different researchers were done their studies on AMSPCH. From this Hunspell based
Typographic Amharic Spell Checker was done by Getaneh Woldeyesus at Graduate School
of telecommunications and Information Technology, Ethiopia. This work mainly focuses
on how Hunspell, an open source spell checker with morphological analyzer library
originally developed for Hungarian language spell checker, can be used to provide for
uninflected Amharic typographic spell checking process. In this case, the research used the
model of Hunspell as a solution for Amharic spelling errors detection and correction [4].
First of all the study generates typographic errors for Amharic language and the technique
for generation of errors works first by selecting words that have three or more characters
from the lexicon, and then selects a position to start the error generation randomly.
In the implementation part of this work, the Hunspell was modified by removing Hungarian
languages specific function calls, capitalization checking removal, and truncation of
affixation rules. In addition, a new word list was generated from the word list to avoid
inflected words from the list.
The research used accuracy, performance, precision, and recall for evaluation of the
proposed spell checker for Amharic language. The accuracy measures how high the
prototype suggests for the generated errors. The performance measures the efficiency for
the prototype in terms of time it takes to generate the correct suggested word. On the other
hand, precision and error recall measures the number of correct suggestions in the total
number of spelling suggestions [4].
The work is done on uninflected Amharic words and errors are generated randomly. First,
generation errors do not reflect how the trends of spelling errors look like in Amharic text
writing. Second, Amharic is a Semitic language with complex morphology. This implies
that we need to consider morphological analysis in developing Amharic spell checkers.
Due to the above reasons, much more effort is needed to study the existence of spelling
errors in Amharic text writing and works on spell checking for inflected Amharic words
especially internal inflected Amharic words.
24
On the other hand, as stated in [5], Amharic language words and their inflections such as:
inflection of nouns, inflection of verbs and inflection of adjectives and as Shewangizaw [4]
identified and studied Amharic error patterns and affixes of Amharic words listed for each
Amharic part of speech. Mekonnen [5], Gaddisa [8] and Shewangizaw [4] works were
implemented based on the frame work of Hunspell, that is default Open Office spell
checker. In Shewangizaw [4] spell checker checks Amharic text written with other word
processor by manually copying and pasting on Open Office and study the dictionary file
and affix file using Latin script and he used transliteration components to translate Amharic
texts to Latin and Latin back to Amharic and the internal inflection not considered.
In Mekonnen [5] study, root words in the dictionary, prefixes and suffixes in the affix rule
respectively without translation. In both [4] and [5] the following activities were not
addressed:
 Management of internal inflection: example ቆራረጠ is derived from root word ቆረጠ.
ራ is added inside the word ቆረጠ before the character ረ.
 Consider real word error checking and correction based on contextual information.
 Consider auto correction, extra spaces removal and repeated word removal.
 Consider Amharic fonts independent spell checker. It is Unicode dependent not
other Amharic fonts are some of the limitation of the previous works.
 Prefix and suffix rules were not exhausted.
So in this study, internal inflection of words, compound words, repeated words, prefixes
and suffixes are exhausted.
25
CHAPTER THREE
DESIGN AND DEVELOPMENT OF AMHARIC LANGUAGE
SPELLING CHECKER
3.1. Amharic Language Spelling Checking
3.1.1. Amharic Language Inflection
As stated in Getahun [26], Amharic part of speech tagging (POS) is categorized in different
classes namely Noun, Verb, Adjective, Preposition, and Adverb. These classes can be
inflected by number, gender, case, definiteness, pronoun, tenses, and person [5].
In this study we investigate how root words are inflected and develop a rule based on the
investigation.
Inflection of nouns
Amharic nouns can be inflected by number, gender, case and definiteness. Nouns are
inflected by adding affixes and by reduplication of nouns [2, 18]. The inflection of noun is
done by adding two suffixes “ዎች” and “ኦች”. By adding the suffix “ዎች” the word “በሬ” is
inflected to “በሬዎች” and using the suffix “ኦች” the word “ዶክተር” is inflected to “ዶክተሮች”.
Examples are shown in table 3.1 and 3.2.
Table 3.1 Inflection of nouns by adding suffix “ኦች”
ነጠላ ቁጥር ብዙ ቁጥር
ዶክተር ዶክተር-ኦች ዶክተሮች
አያት አያት-ኦች አያቶች
ቤት ቤት-ኦች ቤቶች
26
ወንበር ወንበር-ኦች ወንበሮች
ፍየል ፍየል-ኦች ፍየሎች
በግ ብግ-ኦች በጎች
እግር እግር-ኦች እግሮች
እስስት አስስት-ኦች እስስቶች
Table 3.2 Inflection of nouns by adding suffix “ዎች”
ነጠላ ቁጥር ብዙ ቁጥር
ገበሬ ገበሬ-ዎች ገበሬዎች
እርሻ እርሻ-ዎች እርሻወች
ሸማኔ ሸማኔ-ዎች ሸማኔዎች
ነጋዴ ነጋዴ-ዎች ነጋዴዎች
በሬ በሬ-ዎች በሬዎች
ተማሪ ተማሪ-ዎች ተማሪዎች
Another technique of derivation of nouns or inflection of nouns is reduplication; it is done
by repeating the word itself with some modification. The six alphabet is converted into the
forth alphabet and repeat the first word. The example was shown in table 3.3.
Table 3.3 Inflection of nouns by reduplication
ነጠላ ብዙ
ግርድ ግርዳግርድ
ጌጥ ጌጣጌጥ
ትል ትላትል
ብረት ብረታብረት
ጥሬ ጥራጥሬ
ሸቀጥ ሸቀጣሸቀጥ
ጨርቅ ጨርቃጨርቅ
Amharic Words can also inflected based on gender by adding “ኢት”, the word “በግ” can be
inflected to “በግ-ኢት”, “ልጅ” can be inflected to “ልጅ-ኢት” by adding “ኢት”. The other form of
noun inflection is based on cases which concerns usage of the word in a sentence such as
27
subject and object. It is done by adding suffixes such as “-ን”, “-ኤ”, -ህ”. for example by
adding “-ን” in the word “ልጅ” we get “ልጁን”, by adding “-ኤ” we get “ልጅ-ኤ” we get “ልጄ” The
last noun inflection form is based on definiteness which is done by adding suffixes “ኢቱ”,
“ዉ”, “ኡ”, “ዋ” and “ይቱ” [26].
Inflection of Verbs
Verbs have affixes that show subject and object of a sentence [5]. These affixes are “ሽ”,
”ች”, “ህ” etc. Verbs inflected by person, gender, number and tenses. An affix that shows
third person, singular, female and past tense is “ሽ” in “ሄድሽ”, similarly by adding “ህ” we
get “ሄድህ” and the affix shows third person, singular, male and past tense is “ህ” in “ሄድህ”.
As stated in [5], Amharic verbs are the most inflected part of speech in Amharic language.
So as described in Table 3.4, 3.5 and 3.6 in the first column, Verbs in the form of perfect
tense or verbs that indicate third person singular male gender are considered root words in
this study. All root verbs are inflected to compound imperfect, gerund, contingent and
infinitive. Compound imperfect verb is derived from root word by adding affixes such as
ይ--አል, ት--አለች, ይ--አሉ, ን--አለን, ት--አላችሁ, ት--አለህ, ት--አለሽ and etc. Gerund form of a verb is
obtained by adding ኦ suffixes. Contingent and infinitive form of a verb are obtained by
adding ይ-- and መ respectively as shown in table 3.4. Table 3.5 and 3.6 depicts affixes used
in transitive verbs and negation of verbs. Table 3.7 shows the internal inflection of words.
Table 3.4 Inflection of verbs
Root ይ------አል ------ኦ ይ-------እ መ-----እ
ሄደ ይሄዳል ሄዶ ይሂድ መሄድ
ቆረጠ ይቆርጣል ቆርጦ ይቁረጥ መቁረጥ
ሮጠ ይሮጣል ሮጦ ይሩጥ መሮጥ
Table 3.5 Inflection of transitive
Root አስ--- ተ---
28
ገደለ አስገደለ ተገደለ
በላ አስበላ ተበላ
በላች አስበላች ተበላች
Table 3.6 Negative inflections of verbs
Root አት------ም አል------ም አይ-------ም አን-----ም
ተማረች አትማርም አልማርም አርምይመ አንማርም
በላች አትበላም አልበላም አይበላም አንበላም
ገደለ አትገድልም አልገድልም አይገድልም አንገድልም
በላ አትበላም አልበላም አይበላም አንበላም
Table 3.7 Internal inflections
Root Inflection
ገጠመ ገጣጠመ
ቆረጠ ቆራረጠ
ሰበረ ሰባበረ
ቀመሰ ቀማመሰ
ቆመጠ ቆማመጠ
ከተፈ ከታተፈ
ገመጠ ገማመጠ
ደረበ ደራረበ
ከመረ ከማመረ
ሰነጠቀ ሰነጣጠቀ
ገነጠለ ገነጣጠለ
ገለጠ ገላለጠ
29
መረጠ መራረጠ
Inflection of Adjectives
Adjectives are inflected by numbers, cases and definiteness. Inflection of adjective is
similar to inflection of noun when it is inflected by number, cases and definiteness. For
example by adding ኦች the word ጅል, ብልህ, ጠባብ, ጎበዝ inflected to ጅሎች, ብልሆች, ጠባቦች,
ጎበዞች respectively.
3.1.2. Amharic spelling error patterns
In addition to Amharic language spelling error patterns presented by Shewangizaw [4] and
Daniel [27], compound words, abbreviations and mistyping are identified as a sources of
spelling error variations for Amharic language. These sources or error variations and study
for trends of spelling errors for Amharic language are presented below.
Compound words
Amharic writing system uses different compound word writing techniques and there is no
standard to write compound words as a two separate words or a single word [5]. As a result
of this, we get Amharic words having the same meaning but in different ways of context.
For example, it is not clear to select which of these words are right to use: ሰዉ ሰራሽ, ሰዉ-
ሰራሽ ወፍ ዘራሽ, ወፍ-ዘራሽ እጅ አዙር, እጅ-አዙር, ወጥቤት, ወጥ-ቤት, አየርወለድ, አየር-ወለድ, አጥር ግቢ አጥር-
ግቢ. So in our study we use Getahun [26] compound words writing system by concatenating
two words using hyphen. Example አጥር-ግቢ ልብስ-ሰፊ, ሰዉ-ሰራሽ, እንጀራ-ጋጋሪ, ሰርቶ-አዳሪ, አዉቆ-
አበድ, ሰርጎ-ገብ, መንፈቀ-ሌሊት, ቤተ-ክርስተያን.
Abbreviations
30
In English language, it is commonly used words in abbreviation forms. Example Dr, Mr.,
etc. Similarly Amharic language allows writing a word idifferent abbreviation forms for a
single word. So this can be a source of Amharic spelling error variation. For example, when
abbreviating the phrases (ጠቅላይ ሚኒስትር, one can find ጠ/ሚ or ጠ/ሚኒስትር) ዶክተር can be
identified as ዶ/ር, ክፍለከተማ can be written as ክ/ከተማ. Therefore, these kinds of words should
be handled in a spellchecker application when a user enters any of these words.
Syllographic redundancy
Most of Amharic vocabularies are originated from the Geez language [4]. However, it lacks
to preserve Geez’s phonology while it takes its symbol for some of the characters. In
addition to having same pronunciation, each character has its own order. Such types of
issues are inherent from Amharic symbol redundancy and need to be addressed in spell
checking process. Example: “አለምፀሀይ” and “አለምጸሀይ” for “ዓለምፀሐይ”.
Glypheme misidentification
This source of errors occurs due to visual similarity of some Amharic characters. Most of
the time, the characters: ‘ው’ and ‘ዉ’, ‘ፖ’ and ‘ፓ’, ‘ዪ’ and ‘ዩ’, ‘ጕ’ and ‘ጒ’, ‘ቁ’ and ‘ቍ’
are used interchangeably in Amharic writing system. Due to visual similarity of characters,
users may simply choose the form that is easiest to write by hand or type into a computer.
“ነዉ” instead of “ነው” can be taken as glypheme misidentification error types.
False Geezims
This type of source of variations in Amharic words occurs due to inserting the wrong letter,
mostly characters that have to be silent. Example: “ምልክት” vs. “ምልእክት”.
Assimilation and Alternations
It is common in Amharic that, ‘ም’ may be exchanged for ‘ን’ before ‘በ’, as in “ሽንብራ” vs
“ሽምብራ”. This is one source of spelling variation in Amharic writing system.
31
Foreign language transcription
Amharic has some words that are taken from other languages, very often technology terms.
Until some convention emerges there will be conflicting way of writing in transcription of
those words. Example: “ኮምፒዩተር” vs. “ኮምፒውተር”.
Dialect variation
Regional dialects can also impact word formation in the basic level where the words are
more likely to be written following their spoken form; “ሆመጠጠ”, “ኮመጠጠ”, “ሂጂ” vs “ሂጅ”,
“አይዶለም” vs “አይደለም” , “ዓጤ” vs “ዓፄ” are some of them.
Mistyping
It is common to type the misspelled words other than the correct ones during document
processing. The two common reasons for mistyping Amharic words are the number of keys
on a keyboard is fewer than number symbols and different Amharic word processing tools
have different keyboard layout for inputting Amharic word. In phonetic based input
methods, mistyping comes from “shift-slip”, Example “ቴና” for “ጤና” [4].
3.1.3. Affix Rules Development
As described in Section 2.1.3.1, the drawback of storing all forms of words in the lexicon
is, a lexicon containing all correct words could be extremely large. As a result it needs more
space, inefficiency searching time and it is practically impossible to list all correct words.
To minimize this problem we stored only root words in the lexicon and the input word from
the input component checked against the lexicon words by considering root and inflected
words by developing affix rules.
32
During affix rule development, Prefix, Infix and suffix lists collected from different
documents manually.
The dictionary is built from 24649 words, and for affix rule built from 60 prefixes, 1752
suffixes, and for internal inflection words 86 rules and the sample Prefixes, infixes and
suffixes used in this work are listed in Appendix C.
Then the identified prefix, infix and suffix lists need to be categorized so that they can be
integrated to each lexicon entry. According to [26], Amharic POS can be categorized in
five major classes namely Noun, Verb, Adjective, Pronouns, and Adverb. Nouns can be
inflected for number, gender, case and definiteness. Verbs can be inflected for person,
gender, number, mood, and tense. Adjectives are inflected for number, case, and
definiteness. Based on this, we categorized the identified suffix lists and each category was
given a unique identifier.
After suffix and prefix lists are categorized, there should be a rule that indicates how a
given word takes a suffix. For example, a suffix “-ዎች” is allowed for a word በሬ. Hence,
rules were developed which handles such cases.
3.1.4. Dictionary development
Amharic root word dictionary is compiled from different sources. Amsalu Aklilu [28],
Concise Amharic Dictionary Amharic to English and English to Amharic dictionary [29],
ጌታሁን አማረ [26] and ባየ ይማም [20] is taken as base dictionary as they contain part of speech
for many words, compound words and phrases.
While developing the dictionary, this study uses the following steps:
 Remove inflected words from dictionary
 Remove phrases made of two or more words
33
 Add some verbs that are not available in Amsalu Aklilu dictionary and from
Concise Amharic Dictionary Amharic to English and English to Amharic
dictionary.
 Add country names and common person names
 Normalize dictionary entries
 Append rules to each words in the dictionary partially based on Amharic part of
speech.
 Produce a text file consisting of list of Amharic words one per line.
As shown in figure 3.3 Words are listed one word in a row followed by affix rule identifiers
that should be applied to that rule. In first line we should write approximate number of
words. Any word in the dictionary is followed by forward slash and 0 or more flag
identifier. The output is a file (am_ET.dic) with .dic extension. It is an input for Amharic
spell checker program.
As can be seen in figure 3.3, the first entry number 6 indicates estimated number of
root words in the dictionary. Amharic words starting the second line are lists of root
words. The slash after Amharic words is used to indicate that it is end of root word
and beginning of rule identifier. All characters after slash symbol are rule identifier
Figure 3. 1 Sample list of dictionary
6
ሱቅ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN
ሰበረ/MKIMINPOHN
ዜማ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN
ሰለለ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN
ሰገረ/MKNIMINASAGAWANMWMOCWCPOAUAEETNNO2WMWN
ቢሮ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN
34
defined in affix files. Rule identifier, MK points affix rules that append prefixes of
verbs such as ለ, በ, ከ, የ, ስለ, እንደ, ያል andእስከ.
The dictionary is built from 24649 words, and for affix rule built from 60 prefixes,
1752 suffixes, and for internal inflection words 86 rules.
3.1.5. Lexicon lookup
A lexicon lookup algorithm implemented as the long linked list requires going all the
way to the end of the list. Checking every element for equality with a given input
word is very inefficient and slow especially for large lexicon words. Hence, a method
known as hash table dictionary lookup method was implemented for lexicon lookup
process. In this work, we adopted a dictionary lookup algorithm developed by the
author of Hunspell [30].
A Hash table dictionary lookup method was implemented first by calculating a hash
function for a given string. This value is obtained by manipulating the bytes of the
given string [6].
3.2. Design of Amharic Spelling Checker (AMSPCH)
In the design of spell checker has to incorporate general features of a spell checker and
language specific components for the targeted language. In this chapter, we will try to
discuss general and language specific requirements for the designed Amharic language
spell checker. In the design of AMSPCH task issues and requirements addressed.
To do spell checking task, first we need to input words and present as tokens; hence, we
have introduced an input component. Other components of our spell checker are
normalization component, error detection component, morphological analyzer, and error
correction and suggestion component. All of these components are briefly discussed below
in3.2.2.
35
3.2.1. Design Requirements
Lexicon lookup speed, selecting an appropriate technique for detecting and correcting
spelling error, and storage requirements are the general factors that are needed to be
considered in designing a spell checker. Besides, the general requirements of a spell
checker design, one has to consider language specific features of a spell checker. In this
case, our spell checker takes the following typical features of Amharic language that affects
the spell checking process.
Morphological variants of words
Amharic is one of the languages with rich morphology. As it is discussed earlier, one of
the tasks in spell checker is developing a lexicon. There are two options to develop the
lexicon, the first one is to store all forms of words in the lexicon and retrieve from this list,
and the second have only root words in the lexicon then create some affix rule or algorithm
for validating all acceptable words (inflected words) for the language. However, the first
option can have two problems one is the performance, and the other is getting all forms of
words for the given language. To avoid above mentioned problem, developing affix rules
(prefix, infix, and suffix rules) has a solution in the spell checking process [6]. So in this
study we use the second option.
Encoding issue
Previously, Amharic electronic documents were developed using mostly incompatible
software based on different encoding systems. However, software vendors for Amharic
word processing have started to use Unicode in recent times. Moreover, Unicode seems to
be choice of preference to represent Amharic documents. This study focus on Amharic
documents written using Unicode encoding.
3.2.2. Architecture of the Amharic Spell Checker
36
Figure 3.1 depicts the architecture of Amharic spelling checker. The architecture presents
the components of Amharic spelling checker.
Our spell checker is designed to check whether a given Amharic word is correctly typed or
not and gives suggestion for incorrectly typed words. To achieve this goal, five components
are introduced.
The components are:
• Input component,
• Normalization Component,
• Error Detection Component,
• Morphological Analyzer Component,
• Error Correction and Suggestion Component,
From the above five components, the Normalization and Morphological analyzer
components have language specific features and should address the Amharic language
specific characters that are related to spell checking process.
Amharic Spelling Checker Architecture
37
Input Component
An input component is responsible to read characters from open Office tokenize them.
When the user input a word, the input module read characters one by one from open Office
word processor. If the user presses space bar or punctuation marks shown in table 3.8 or
pastes electronic documents, then the input component tokenize texts. On the other hand if
visual or Syllographic redundant characters, then the input module represents them by their
predefined representative. After tokenized the text, then it passes the input word to
Normalization component for further processing. The algorithm for this component is
presented in Figure 3.2.
Table 3.8 Amharic punctuation marks
Word
separator
period comma colon Semi-
colon
Preface
colon
Question
mark
Exclamation
Mark
: :: ፣ ፤ ÷ :- ? !
Figure 3.2 Architecture of Amharic spell checker adopted from [3].
Begin
1: Make token empty
2: Read a character
2.1 If character is one of Amharic punctuation marks call
Normalization module
Else if a character is end of file Exit
Else append the read character to the token
3: Move pointer to the next character then go to step 2
38
Figure 3.3 Algorithm for Input component adopted from [3]
Normalization Component
As it is discussed in section 2.2.5, one source of error variation in Amharic is Syllographic
redundancy or the presence of some repetitive alphabets that can be used interchangeably
in Amharic words. A spell checker for such a language should be able to address this issue.
As a result, those types of words should come into a common form. Hence, one use of this
component is to apply a rule on input words which have Syllographic redundancy problem.
Error Detection Component
This component accepts the decomposed word from the Morphological analyzer
component. Then, it checks whether the returned word exists in the lexicon or not.
Consequently, the error detection component passes the non-word to the Error correction
component so that the user gets the suggested list.
Amharic spell checker is dictionary based spell checker. Misspelled words are identified
by using dictionary lookup algorithm. First check the existence of token output from
39
Normalization module in the dictionary. If exists, then it is root word and treated as correct
spell word. Otherwise, it is passed to morphological analyzer to check if it is one of
inflected words or not. If morphological analyzer strips affixes added to a root word, then
root word is passed to error detection modules to check it existence in the dictionary [5].
 Morphological Analyzer Component
As described in 2.2.6, Amharic is a morphologically complex language, whose basic units
are mostly consonantal roots. As a result of its complexity, all classes of words are highly
inflected and contain lots of information in a single word.
This situation has to be addressed while designing the spell checker for such language. In
other words, we need to accept all valid inflected words in addition to the root words.
Other than a spell checker, morphological analyzers were used for information retrieval,
POS tagging, Machine translation etc [4]. The task of our Morphological analyzer
components accepting a word from the error detection component, decomposing the input
word into stem and affixes based on predefined Amharic language word formation rules,
and then passing the resulted stem and affix to the error detection module.
This morphological analyzer is limited to inflectional morphology. As a result, it considers
internal derivational morphology (for example ፈለገ into ፈላለገ, ገጠመ into ገጣጠመ). This kind
of words will be considered as an internal inflection and rules are applied to it.
We adopted the morphological analysis methods used by Hunspell [30] for the claimed
misspelled word, first by developing word formation rules for Amharic language. The
details of these rules are presented next.
Input: word I_Word from Error detection component
Output: list of affix and root words
Start
1. Scan input word from right to left and left to right to
look for valid suffix and prefix
For each valid suffix in I_Word strip them and store
result in a buffer
For each valid prefix in I_Word strip them and store
result in a buffer
//pass list of affix and stems to the error detection module
Return root and affix
40
 Error Correction and Suggestion Component
After the input word is flagged as a non-word, the spelling error has to be corrected and we
get a list of suggested words so that we will select from the list. The error correction and
suggestion component was designed to accomplish this task. Hence, this component inputs
a word from Error detection component, searches all possible list of corrections from the
lexicon as a suggestion, and it ranks list of words.
In this work, Levenshtein edit distance has been used for error correcting and suggestion.
Detail, see Appendix F.
Levenshtein edit distance algorithm works by defining some threshold value which
indicates a maximum distance for possible list of words as a suggestion. Shewangizaw [4]
finds a single word error contributes 88.8% of the total error.
Figure 3.4 Algorithm for Morphological Analysis adopted from [8]
41
CHAPTER FOUR
EXPERIMENT, RESULT AND DISCUSSION OF AMHARIC SPELL
CHECKER
4.1. Introduction
This section describes the detail of the experiment based on source of error variation, error
detection and correction techniques, and spelling error trends in Amharic documents.
4.2. Prototype
4.2.1. Input word processing
As stated in section 3.2.2, the designed Amharic spell checker has an input component
which takes a text file as an input and applies Tokenization and Normalization to generate
words for error detection component.
Tokenization is the process of breaking up the given text into units called tokens. The
tokens may be words or number or punctuation mark. It can occur at a number of different
levels: paragraphs, sentences, words, syllables, or phonemes [31]. This process needs word
42
boundaries of a given text or punctuation marks and encoding of a given language. In this
work, only words with Unicode encoding are demarcated or tokenized.
As discussed in section 3.2.2, Amharic has its own punctuation marks that demarcate
words, sentences, etc. But instead of using punctuation marks white spaces are used to
demarcate Amharic words in electronic documents.
Therefore, tokenization for Amharic text is done by considering all Amharic punctuation
Marks (i.e. word separator, period, comma, colon, semicolon, preface colon, question
mark, and exclamation mark) and white space.
4.2.2. Implementation of spell checker in Open Office using Hunspell
As discussed in section 2.2.6, Hunspell is the default spell checker for openoffice.org. It
requires two files to define the language spell checking. The first file is a lexicon containing
words for the language (Amharic words in our case), and the second is an affix file that
defines the meaning of special flags in the lexicon. This affix file contains the prefix and
suffix rules to be associated with the words in the lexicon.
As shown in figure 4.1 a lexicon file (am_ET.dic) contains a list of Amharic root words.
The first line of the lexicon contains approximate number of entries in the lexicon file.
Each word may optionally be followed by a slash (“/”) and one or more flags, which
represents prefix infix and suffix rules.
An affix file (*.aff) may contain a lot of optional attributes. For example, SET is used for
setting the character encoding of affixes and lexicon files. PFX and SFX defines prefix and
suffix classes respectively named with affix flags. The following example describes the
structure of the affix file of Hunspell.
Affix file:
SET UTF-8
1. SFX OC Y 27
2. SFX OC 0 ዎ‹ [^IMU`eipwê‹”˜¡¨<ÃÉÏ´»åêõý]
3. SFX OC I J‹ I
4. SFX OC M KA‹ M
4. SFX OC p q‹ p
-
-
5. SFX AA ý þ‹
43
As shown in figure 4.1, ህልቅ and ፕ in third column are characters that are stripped before
affixation. ህልቅ and ፕ in fifth column are characters that are checked if one of them is
last character of a word before affixation. 0 in line two and third column indicates that
nothing is stripped when ች is affixed. ሆች, ሎች, ቆች, ፖች are affixes to be affixed if
specified condition is fulfilled. [^ህልምሽቅብትችንኝእክውዝዥጵጽፍፕ] is a condition to check that
the last character is not one of fifth order characters in Amharic scripts. So figure 4.1 shows
how to define rules that is used in derivation of nouns to their plural forms.
Amharic verbs are highly inflected than other Amharic part of speeches. It is inflected by
person, tense, Gender, and cases. Some prefixes and suffixes of verbs are dependent to
each other. This dependency is controlled by CIRCUMFIX commands.
CIRCUMFIX XX
PFX EE Y 1
PFX EE 0 እን /xx
SFX TA Y 2
SFX TA O ለን/EEXX [ሃላማሳራሻቃባታቻናኛዛዣያዳጃጳጻፋፓ]
SFX TA ደዳለን/EEXX ደ
If ሄደ/TA and በላ/TA are in the dictionary, እንሄዳለን and እንበላለን are valid inflected words
where as ሄዳለን, በላለን, እንሰብረ and እንበላ are invalid words and marked as misspelled words.
Amharic verbs have also subjected and/ or object indicator suffixes. For root word ገደለ, we
get its inflected form ገደሉዋቸዉ by adding ዉ which is subject indicator and ኣቸዉ which is
object indicator.
Figure 4.1 sample affix rule
44
In this work the rule identifiers and their corresponding affixes are listed in Appendix B.
Because of its complexity, Amharic verbs need exhaustive rules. The output is a file named
am_ET.aff with .aff extension. It is an input to error detection and correction modules.
4.3. Experiment result and Discussion
In this section evaluation criteria for the prototype followed by how the training and testing
data was prepared and described. In addition the results obtained are presented and
discussed in this section.
4.3.1. Evaluation Criteria
The system is evaluated to test its effectiveness. Different research works have proposed
various criterion for evaluation of a given spell checker. Shewangizaw [4], Gaddisa [8] and
Mekonnen [5] recommend that error recall, precision recall and suggestion adequacy for
the evaluation of spell checker algorithm.
The performance of the system is evaluated using precision and recall. Precision can be
seen as a measure of exactness, whereas Recall is a measure of completeness. Precision
and recall are defined in [5, 33], for information retrieval as follows. Precision is defined
as the number of relevant documents retrieved by a search divided by the total number of
documents retrieved by that search, and
Recall is defined as the number of relevant documents retrieved by a search divided by the
total number of existing relevant documents (which should have been retrieved).
Precision and Recall are define in [5, 33]. In the following way for statistical classification
tasks.
Precision for a class is the number of true positives divided by the total number of elements
labeled as belonging to the positive class [32].
45
The formula for calculating recall is =
∑ True Positive
∑True Positive + ∑ False Negative
Recall in this context is defined as the number of true positives divided by the total number
of elements that actually belong to the positive class [32].
The formula for calculating precision is =
∑ True Positive
∑True Positive + ∑ False Positive
 True Positive – which means that the spell checker identifies correctly spelled
words
 False Positive – which means that the spell checker treats misspelled words as
correct spelled words
 False Negative - This means that the correct spelled words are flagged by spell
checker as incorrect word.
 True Negative - which means that the spell checker identifies misspelled words
The dataset is taken from different region reports which were used to study error trends in
Amharic texts. Five sets of test data have been collected from different summarized reports
collected from afar, Amhara National Regional State Science Technology and information
communication commission(ANRS STICC), Afar Region, and Harari regions. For
misspelled words, the intended valid Amharic word was given manually. The evaluation
result is presented in table 4.8.
4.3.2. Experiment
The experiment has been conducted to measure the effectiveness of the Amharic language
spell checker. As mentioned in section 4.3.1 we used precision and recall to measure the
46
accuracy of the Amharic language spell checker. Five experiments (1, 2, 3, 4 and 5) done
to evaluate the accuracy of the the syetm.
I. Experiment 1
For this experiment, the text was taken directly from Amhara Science Technology and
information communication commission 2009 annual report. Then the text was checked
against the lexicon to evaluate the accuracy of the system.
The data for this experiment have 199 Amharic words out of which 5(2.5%) words are
misspelled, from these all the misspelled words are detected by the system. But one correct
word marked as misspelled word by the system. That means one correct word detected by
the system as misspelled word.
All words in the sample data recognized as misspelled by the spell checker system are
automatically flagged. The following figure 4.5 shows screen shoot taken from the output
of the spelling checker tested using sample data and table 4.1 shows the result of the
experiment 1.
47
Figure 4.2 Sample text screen shot of Experiment 1
Results of Experiment 1
As described in the above formula the result of precision and recall shown below
Table 4.1 Evaluation Result for Experiment 1
Results Precision Recall
True positive (TP) 193 TP/(TP+FP)x100
=193/(193+0)x100
=100%
TP/(TP+FN)x100
=193/(193+1)x100
=99.4%
False negative(FN 1
False positive(FP) 0
True negative(TN) 5
Experiment 2
48
To see the performance of the system, we have increased the data taken from Amhara
national regional state science technology and information communication commission
annual report. The sample data for this experiment have 840 Amharic words out of which
16(1.9%) words are misspelled, from these all the misspelled words are detected by the
system. But 20(2.4%) correct word marked as misspelled word by the spelling checker.
The evaluation result of experiment 2 was shown in table 4.2.
All words in the sample data recognized as misspelled word by the spell checker are
automatically flagged. The screen shoot taken from the output of the spelling checker tested
using sample data are shown in appendix D. As we observed in experiment 2, the number
of test data increases the false negative also increases.
Table 4.2 Evaluation Result for Experiment 2.
Results Precision Recall
True positive (TP) 804 TP/(TP+FP)x100
=804/(804+0)x100
=100%
TP/(TP+FN)x100
=804/(804+20)x100
=97.6%
False negative(FN 20
False positive(FP) 0
True negative(TN) 16
II. Experiment 3
For this experiment 3, the text was taken directly from afar Region ICT 2009 annual report,
it consists of 181 Amharic words out of which 7(3.9%) words are detected as misspelled.
But three correct words are marked as misspelled word.
The screen shoot is presented in figure 4.6. The evaluation result is presented in table 4.3.
Table 4.3 Evaluation Result for Experiment 3
Results Precision Recall
True positive (TP) 171
49
False negative(FN 3 TP/(TP+FP)x100
=171/(171+0)x100
=100%
TP/(TP+FN)x100
=171/(171+7)x100
=96.1%
False positive(FP) 0
True negative(TN) 7
Figure4.3 Sample text screen shot of Experiment 3
III. Experiment 4
Similar to Experiment 1, Experiment 2 and Experiment 3, the text was taken directly from
Harari Region ICT 2009 annual report; it consists of 94 Amharic words out of which
9(9.5%) words are detected as misspelled and correct words are marked as misspelled word
by the system.
50
As shown in figure 4.7, the screen shoot taken from the output of the spelling checker tested
using sample data. The precision and recall evaluation result for experiment is shown in
table 4.4 below.
Table 4.4 Evaluation Result for experiment 4
Results Precision Recall
True positive (TP) 79 TP/(TP+FP)x100
=79/(79+0)x100
=100%
TP/(TP+FN)x100
=79/(79+9)x100
=89.8%
False negative(FN 6
False positive(FP) 0
True negative(TN) 9
Figure4.4 Sample text screen shot of experiment 4
III. Experiment 5
Similar to experiment 1 and 2, 3we used precision and recall to measure accuracy of the
Amharic spell checker. To compare the result calculated from the output of the system to
manual checked by experts, the same data used for Experiment 4 was evaluated by
Language Expert. He identified, total number of words 94 similar to experiment 4, invalid
51
or misspelled words 7(7.4%) it decreases by two compared to Experiment 4 evaluated by
the system, correct words but he marked as misspelled word is zero and finally total
unidentified word by the expert is two.
Expert Evaluation
To compare the accuracy of the system the same dataset is given to evaluate by the
language expert. The expert identified 87 correct, 7 misspelled and 0 incorrectly marked
as misspelled.
The precision and recall evaluation result for experiment is shown in table 4.5 below.
Table 4.5 Evaluation Result for experiment 5
Expert System
Results Precision Recall Results Precision Recall
True positive (TP) 87 TP/(TP+FP)x100
=87/(87+2)x100
=97.75%
TP/(TP+FN)x100
=87/(87+7)x100
=92.55%
79 TP/(TP+FP)x100
=79/(79+0)x100
=100%
TP/(TP+FN)x100
=79/(79+9)x100
=89.8%
False
negative(FN)
0 6
False
positive(FP)
2 0
True
negative(TN)
7 9
As it can be seen in the table, there is a difference between the system and the expert. The
reason of the difference is evaluated in experiment 6.
4.3.3. Discussion
Generally, the above experiments (1, 2, 3, 4, and 5) are summarized in table 4.6 below.
52
Table 4.6 Experiment result summery
Exp
erim
ent
Total
Number
of words
Total
Misspelled
words
Total
Detected
Misspelled
words
Total
Undetected
misspelled
words
Correct
words but
marked as
misspelled
words
Precision Recall
1 199 5 5 0 1 100 99.4
2 840 16 16 0 20 100 97.6
3 181 7 7 0 3 100 96.1
4 94 9 9 0 6 100 89.8
5 94 9 7 2 0 97.75 92.55
As shown in table 4.7 the overall performance measure determines how accurate a spelling
checker is and calculated using the following formula [33].
P = (Tp+Tn)/(Tp+Tn+Fp+Fn)
Table 4.7 Average performance calculated from overall performance of each Experiment
Experiment TP TN FP FN Precision Recall Overall
Performance
1 193 5 0 1 100 99.4 99.5
2 804 16 0 20 100 97.6 97.62
3 171 7 0 3 100 96.1 98.34
4 79 9 0 6 100 89.8 93.62
5 87 7 2 0 97.75 92.55 97.92
Average Performance 97.4
Based on the evaluation done in all experiments, 1408 words were taken from different
sources; out of 1408 words, 46 words are misspelled, 44 misspelled words are detected and
2 misspelled words undetected by a language expert, 30 correct words detected or marked
as misspelled word.
As shown in Table 4.7, the result of precision of the system is more accurate than the result
checked manually. And the average recall and precision of the system tested in Experiment
53
1, 2, 3, and 4is 95.75, 100 respectively. Compared to Experiment 5 tested manually by
language expert is 92.55, and 97.75. So based on the result the system performance is
better. And the overall performance of the system is 97.27%.
As we can see from the experiment, we observe that Amharic spell checker lacks
completeness which is indicated by Recall in all experiments.
So we try to check the reason of lack of completeness by randomly taking experiment
4from all experiments and try to exhaust the affix rules in experiment below.
Experiment 6
We select Experiment 4 randomly, as presented in experiment 4, the text was taken directly
from Harari Region ICT 2009 annual report and it consists of 94 words.
As shown in figure 4.8 below, the screen shoot taken from the output of the spelling checker
tested by add affix rules and words in the lexicon. The precision and recall evaluation result
for experiment is presented in table 4.8 below.
Table 4.8 Evaluation Result for experiment 6
Results Precision Recall
True positive (TP) 84 TP/(TP+FP)x100
=84/(84+0)x100
=100%
TP/(TP+FN)x100
=84/(84+1)x100
=98.89%
False negative(FN 1
False positive(FP) 0
True negative(TN) 9
The overall performance of the system based on experiment 6
P = (Tp+Tn)/(Tp+Tn+Fp+Fn)
=(84+9)/(84+9+0+1+)%
=98.93% compared to experiment 4 experiments 6 is better performance.
So as we can see in table 4.8 the false negative reduced from 6 to 1. Based on this
experiment the reason of lack of completeness is:
54
1. Affix rules defined in the development are not exhaustive.
2. Complexity of Amharic language, all words are not included the in dictionaries.
So to enhance the performance of the system it is better to exhaust the above mentioned
problems.
Figure 4.5 Sample text screen shot of experiment 6
CHAPTER FIVE
CONCLUSIONS AND RECOMMENDATIONS
5.1. Conclusions
55
Document preparation is one of the main tasks in government and non government
organizations. A spelling error may occur when people use text processing application.
Hence, text processing application software has integrated spell checkers, and grammar
checkers for some languages. But, for Amharic text processing tools are not integrated.
Thus, it is common to find various Amharic books and newsletters that are published with
misspelled words. This research has been done to design and develop a spell checker tool
for Amharic texts. It involved study spelling errors that can occur in Amharic text writing
and development of Amharic spell checker. In addition, we adopted word formation rules
for Amharic language which can be integrated to the lexicon used by Amharic spell
checker. This lexicon was compiled from ጌታሁን አማረ [26], ባየ ይማም [20], and
concise Amharic dictionary, the lexicon list of names, and list of countries.
We demonstrate by integrating to open office in the development of Amharic spell checker.
The Amharic electronic text spell checker integrated to open office word processor in as-
you-type mode, word formation and lexicon dependent design type. It is also a word level
spell checker particularly non-word error detector spell checker. That is, it does not
consider real word errors, grammatical error and white space. It is a customized version of
Hunspell spell checker. The algorithms and the architecture are inherently dependent on
Hunspell spell checker.
In this work we added some new features that are not addressed in previous works. These
are internal inflected words, repeated words stated in previous researchers[3,4] are
included, dictionary that does not require transliteration when token is accepted to process
spell checking and when suggestion lists are generated. The usage of Unicode data is
supposed to increase performance of spell checking by avoiding transliteration. Finally we
try to measure the performance of the system by taking 5 experiments and calculating the
recall and precision. Then we got the overall performance of the system is 97.27%.And
finally recommendations are shown in section 5.2.
5.2. Recommendations
The following recommendations are made for further research and improvement.
56
 Amharic documents display real word errors in addition to non-word spelling
errors. Hence, there is a need for detection and correction of real word errors
that can occur in Amharic documents;
 Dialectic variations, false geezims; assimilation and Alternations are sources
of error variations which are not done in this thesis work. If there is a method
that handles these issues in our input component, the performance might be
better;
 The performance of spelling error detection and correction algorithm, which
is edit distance, need to be compared with other identified spelling error
correction techniques;
 Integrating this work with other Amharic NLP works like;
• Amharic search engine applications
• Amharic speech synthesis applications
 Automatic spelling error correction and suggestion.
REFERENCE
[1] Shewangizaw Gulilat, DESIGN AND IMPLEMENTAION OF SPELL CHECKER FOR AMHARIC.
ADDIS ABABA, February, 2009.
[2] ANRS Plan Comission, "Development Indicator of Amhara National State," p. 83, 2017.
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language
Automatic Spelling Checker for Amharic Language

More Related Content

Similar to Automatic Spelling Checker for Amharic Language

A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...IRJET Journal
 
Assistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedAssistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedEditor IJCATR
 
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...TELKOMNIKA JOURNAL
 
Group 04 te_a_mini project_ report
Group 04 te_a_mini project_ reportGroup 04 te_a_mini project_ report
Group 04 te_a_mini project_ reportAdityaSingh1722
 
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...IRJET Journal
 
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTpart of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTarteimi
 
Automated Essay Scoring
Automated Essay ScoringAutomated Essay Scoring
Automated Essay ScoringRichard Hogue
 
IRJET- Spelling and Grammar Checker and Template Suggestion
IRJET- Spelling and Grammar Checker and Template SuggestionIRJET- Spelling and Grammar Checker and Template Suggestion
IRJET- Spelling and Grammar Checker and Template SuggestionIRJET Journal
 
From requirements to ready to run
From requirements to ready to runFrom requirements to ready to run
From requirements to ready to runijfcstjournal
 
Electronic Student course registration System
Electronic Student course registration SystemElectronic Student course registration System
Electronic Student course registration SystemOkpehHarrison
 
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATIONSPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATIONIRJET Journal
 
Automatic Text Summarization Using Natural Language Processing (1)
Automatic Text Summarization Using Natural Language Processing (1)Automatic Text Summarization Using Natural Language Processing (1)
Automatic Text Summarization Using Natural Language Processing (1)Don Dooley
 
Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modelingeSAT Publishing House
 
Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modelingeSAT Journals
 
IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET Journal
 
Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...IRJET Journal
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTNathan Mathis
 
English speaking proficiency assessment using speech and electroencephalograp...
English speaking proficiency assessment using speech and electroencephalograp...English speaking proficiency assessment using speech and electroencephalograp...
English speaking proficiency assessment using speech and electroencephalograp...IJECEIAES
 
IRJET- Voice Modulation and Verification for Smart Authentication System
IRJET- Voice Modulation and Verification for Smart Authentication SystemIRJET- Voice Modulation and Verification for Smart Authentication System
IRJET- Voice Modulation and Verification for Smart Authentication SystemIRJET Journal
 

Similar to Automatic Spelling Checker for Amharic Language (20)

A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
A Novel Method for An Intelligent Based Voice Meeting System Using Machine Le...
 
Assistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedAssistive Examination System for Visually Impaired
Assistive Examination System for Visually Impaired
 
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
Quality Translation Enhancement Using Sequence Knowledge and Pruning in Stati...
 
Group 04 te_a_mini project_ report
Group 04 te_a_mini project_ reportGroup 04 te_a_mini project_ report
Group 04 te_a_mini project_ report
 
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
IRJET- QUEZARD : Question Wizard using Machine Learning and Artificial Intell...
 
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTpart of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
 
Automated Essay Scoring
Automated Essay ScoringAutomated Essay Scoring
Automated Essay Scoring
 
raju
rajuraju
raju
 
IRJET- Spelling and Grammar Checker and Template Suggestion
IRJET- Spelling and Grammar Checker and Template SuggestionIRJET- Spelling and Grammar Checker and Template Suggestion
IRJET- Spelling and Grammar Checker and Template Suggestion
 
From requirements to ready to run
From requirements to ready to runFrom requirements to ready to run
From requirements to ready to run
 
Electronic Student course registration System
Electronic Student course registration SystemElectronic Student course registration System
Electronic Student course registration System
 
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATIONSPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
SPEECH RECOGNITION WITH LANGUAGE SPECIFICATION
 
Automatic Text Summarization Using Natural Language Processing (1)
Automatic Text Summarization Using Natural Language Processing (1)Automatic Text Summarization Using Natural Language Processing (1)
Automatic Text Summarization Using Natural Language Processing (1)
 
Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modeling
 
Static dictionary for pronunciation modeling
Static dictionary for pronunciation modelingStatic dictionary for pronunciation modeling
Static dictionary for pronunciation modeling
 
IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Text Optimization/Summarizer using Natural Language Processing
 
Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...Generation of strings in language for given Regular Expression and printing i...
Generation of strings in language for given Regular Expression and printing i...
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
 
English speaking proficiency assessment using speech and electroencephalograp...
English speaking proficiency assessment using speech and electroencephalograp...English speaking proficiency assessment using speech and electroencephalograp...
English speaking proficiency assessment using speech and electroencephalograp...
 
IRJET- Voice Modulation and Verification for Smart Authentication System
IRJET- Voice Modulation and Verification for Smart Authentication SystemIRJET- Voice Modulation and Verification for Smart Authentication System
IRJET- Voice Modulation and Verification for Smart Authentication System
 

Recently uploaded

VIP Kolkata Call Girl Entally 👉 8250192130 Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130  Available With RoomVIP Kolkata Call Girl Entally 👉 8250192130  Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130 Available With Roomdivyansh0kumar0
 
Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024CollectiveMining1
 
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls KolkataHigh Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
CALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual serviceanilsa9823
 
Call Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our EscortsCall Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our Escortsindian call girls near you
 
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our EscortsVIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escortssonatiwari757
 
VIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our Escorts
VIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our EscortsVIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our Escorts
VIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our Escortssonatiwari757
 
Call Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In Amritsar
Call Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In AmritsarCall Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In Amritsar
Call Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In Amritsaronly4webmaster01
 
OKC Thunder Reveal Game 2 Playoff T Shirts
OKC Thunder Reveal Game 2 Playoff T ShirtsOKC Thunder Reveal Game 2 Playoff T Shirts
OKC Thunder Reveal Game 2 Playoff T Shirtsrahman018755
 
CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service 🧦
CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service  🧦CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service  🧦
CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service 🧦anilsa9823
 
Sustainability Leadership, April 26 2024
Sustainability Leadership, April 26 2024Sustainability Leadership, April 26 2024
Sustainability Leadership, April 26 2024TeckResourcesLtd
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call Girladitipandeya
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...aditipandeya
 
Q3 FY24 Earnings Conference Call Presentation
Q3 FY24 Earnings Conference Call PresentationQ3 FY24 Earnings Conference Call Presentation
Q3 FY24 Earnings Conference Call PresentationSysco_Investors
 

Recently uploaded (20)

VIP Kolkata Call Girl Entally 👉 8250192130 Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130  Available With RoomVIP Kolkata Call Girl Entally 👉 8250192130  Available With Room
VIP Kolkata Call Girl Entally 👉 8250192130 Available With Room
 
Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024
 
@9999965857 🫦 Sexy Desi Call Girls Vaishali 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Vaishali 💓 High Profile Escorts Delhi 🫶@9999965857 🫦 Sexy Desi Call Girls Vaishali 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Vaishali 💓 High Profile Escorts Delhi 🫶
 
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls KolkataHigh Profile Call Girls Kolkata Gayatri 🤌  8250192130 🚀 Vip Call Girls Kolkata
High Profile Call Girls Kolkata Gayatri 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
CALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Fazullaganj Lucknow best sexual service
 
Call Girls Service Green Park @9999965857 Delhi 🫦 No Advance VVIP 🍎 SERVICE
Call Girls Service Green Park @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SERVICECall Girls Service Green Park @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SERVICE
Call Girls Service Green Park @9999965857 Delhi 🫦 No Advance VVIP 🍎 SERVICE
 
Call Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our EscortsCall Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our Escorts
Call Girls in Friends Colony 9711199171 Delhi Enjoy Call Girls With Our Escorts
 
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our EscortsVIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
 
VIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our Escorts
VIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our EscortsVIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our Escorts
VIP Amritsar Call Girl 7001035870 Enjoy Call Girls With Our Escorts
 
Preet Vihar (Delhi) 9953330565 Escorts, Call Girls Services
Preet Vihar (Delhi) 9953330565 Escorts, Call Girls ServicesPreet Vihar (Delhi) 9953330565 Escorts, Call Girls Services
Preet Vihar (Delhi) 9953330565 Escorts, Call Girls Services
 
Call Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In Amritsar
Call Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In AmritsarCall Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In Amritsar
Call Girls In Amritsar 💯Call Us 🔝 76967 34778🔝 💃 Independent Escort In Amritsar
 
OKC Thunder Reveal Game 2 Playoff T Shirts
OKC Thunder Reveal Game 2 Playoff T ShirtsOKC Thunder Reveal Game 2 Playoff T Shirts
OKC Thunder Reveal Game 2 Playoff T Shirts
 
Vip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Vasant Kunj ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service 🧦
CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service  🧦CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service  🧦
CALL ON ➥8923113531 🔝Call Girls Vineet Khand Lucknow best Night Fun service 🧦
 
Sustainability Leadership, April 26 2024
Sustainability Leadership, April 26 2024Sustainability Leadership, April 26 2024
Sustainability Leadership, April 26 2024
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Miyapur high-profile Call Girl
 
Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Uttam Nagar Delhi reach out to us at 🔝9953056974🔝
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
 
Call Girls 🫤 East Of Kailash ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ...
Call Girls 🫤 East Of Kailash ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ...Call Girls 🫤 East Of Kailash ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ...
Call Girls 🫤 East Of Kailash ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ...
 
Q3 FY24 Earnings Conference Call Presentation
Q3 FY24 Earnings Conference Call PresentationQ3 FY24 Earnings Conference Call Presentation
Q3 FY24 Earnings Conference Call Presentation
 

Automatic Spelling Checker for Amharic Language

  • 1. DSpace Institution DSpace Repository http://dspace.org Computer Science thesis 2020-05-21 AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE TILAHUN, MELAKU http://hdl.handle.net/123456789/10846 Downloaded from DSpace Repository, DSpace Institution's institutional repository
  • 2. BAHIR DAR UNIVERSITY BAHIR DAR INSTITUTE OF TECHNOLOGY SCHOOL OF RESEARCH AND POSTGRADUATE STUDIES FACULTY OF COMPUTING AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE MELAKU TILAHUN ASRESS BAHIR DAR, ETHIOPIA OCTOBER 16, 2017
  • 3. i AUTOMATIC SPELLING CHECKER FOR AMHARIC LANGUAGE MELAKU TILAHUN ASRESS A thesis submitted to the school of Research and Graduate Studies of Bahir Dar Institute of Technology, BDU in partial fulfillment of the requirements of the degree Of Master in computer Science in faculty of computing Advisor Name: Dr.Tesfa Tegegne Bahir Dar, Ethiopia October 2017
  • 4. ii DECLARATION I, the undersigned, declare that the thesis comprises my own work. In compliance with internationally accepted practices, I have acknowledged and refereed all materials used in this work. I understand that non-adherence to the principles of academic honesty and integrity, misrepresentation/ fabrication of any idea/data/fact/source will constitute sufficient ground for disciplinary action by the University and can also evoke penal action from the sources which have not been properly cited or acknowledged. Name of the student______________________________ Signature _____________ Date of submission: ________________ Place: Bahir Dar This thesis has been submitted for examination with my approval as a university advisor. Advisor Name: __________________________________ Advisor’s Signature: ______________________________
  • 5. iii © 2017 MELAKU TILAHUN ASRESS ALL RIGHTS RESERVED
  • 6. iv Bahir Dar University Bahir Dar Institute of Technology- School of Research and Graduate Studies Faculty of computing THESIS APPROVAL SHEET Student: Melaku Tilahun Asress __________________________ ____________________ Name Signature Date The following graduate faculty members certify that this student has successfully presented the necessary written final thesis and oral presentation for partial fulfillment of the thesis requirements for the Degree of Master of Science in computer science. Approved By: Advisor: Dr.Tesfa Tegegne _____________________ ____________________ Name Signature Date External Examiner: Dr.Adane Letta_ _______________________ ____________________ Name Signature Date Internal Examiner: ________________ _____________________ ____________________ Name Signature Date Chair Holder: ___________________ _______________________ ____________________ Name Signature Date Faculty Dean: ___________________ _______________________ ____________________ Name Signature Date
  • 7. v To my mother, father and my wife
  • 8. vi ACKNOWLEDGEMENTS I would like to acknowledge my gratitude to my advisor Dr.Tesfa Tegegne for his Best advising and supporting throughout the completion of the thesis, and I would like to acknowledge my gratitude to Mr.Mekonnen Fentaw for his willingness and support the open areas and advising how my research work will be going on, and also acknowledge my gratitude to Mr.Belisty Yalew, Mr.Fentahun Mekuriaw, Mr.Bawoke Wondem and Mr.Elias wondemagegn for their professional guidance and assistance. I would like to acknowledge my gratitude to all my colleagues at the Department of Computer Science for their cooperative. Finally I would like to acknowledge my gratitude to my wife Yirgalem tadesse for her support and to give me time to accomplish the research work.
  • 9. vii ABSTRACT In different government and non government organizations document preparation is one of the tasks in day to day activities. A spelling error can occur when people use text processing application to produce electronic documents. There are some works except on internal inflection of words and repeated words which is unsatisfactory for a language having complex morphology. Due to this reason, it is common to find various Amharic books and newsletters that are published with misspelled words. In this study, an attempt has been made to design and implement spell checker for Amharic language that works on inflection of Amharic words (internal inflection, inflection by duplication of Amharic words also part of this study). The design of our study has 5 components, namely, Input component, normalization Component, error detection component, morphological analyzer component, and spelling error correction and suggestion component. The system has been evaluated with four sets of data. The first and the second sets of data taken from Amhara national regional state, science technology, and information communication commission 2009 annual report. The third set of data taken from afar Region ICT 2009 annual report. The fourth set of data taken from Harari Region ICT 2009 annual report. The performance of the system is evaluated using precision and recall. Finally the system evaluated using 5 experiments and we got 97.27% overall performance of the system. As are commendation, Detection and correction of real word errors, the performance of spelling error detection and correction algorithm, which is edit distance, need to be compared with other identified spelling error correction techniques, integrating this work with other Amharic NLP works like, automatic spelling error correction and suggestion. TABLE OF CONTENTS DECLARATION ................................................................................................................. i
  • 10. viii ACKNOWLEDGEMENTS............................................................................................... vi ABSTRACT......................................................................................................................vii LIST OF ABBREVATIONS .............................................................................................. x LIST OF FIGURES ........................................................................................................... xi LIST OF TABLES............................................................................................................xii CHAPTER ONE................................................................................................................. 1 1. INTRODUCTION....................................................................................................... 1 1.1. Background............................................................................................................. 1 1.2. Motivation............................................................................................................... 2 1.3. Statement of the Problem........................................................................................ 4 1.4. Objective of the Study ............................................................................................ 5 1.4.1. General Objective ............................................................................................. 5 1.4.2. Specific Objectives ........................................................................................... 6 1.5. Scope and limitation of the Study........................................................................... 6 1.5.1. Scope of the Study ............................................................................................ 6 1.5.2. Limitation of the Study..................................................................................... 6 1.6. Significance of the Study........................................................................................ 7 1.7. Research Methodology ........................................................................................... 7 1.7.1. Literature Review.............................................................................................. 7 1.7.2. Data Collection and Preparation ....................................................................... 8 1.7.3. Implementation tools ........................................................................................ 8 1.7.4. Performance Evaluation techniques.................................................................. 8 1.7.5. Organization of the thesis ................................................................................. 9 CHAPTER TWO .............................................................................................................. 10 LITERATURE REVIEW AND RELATED WORKS..................................................... 10 2.1. Literature Review.................................................................................................. 10 2.1.1. Amharic text spell checker.............................................................................. 10 2.1.2. Types of Spelling Errors ................................................................................. 11 2.1.3. Core functionalities of spell checkers............................................................. 12 2.1.4. Spelling checker tools..................................................................................... 17 2.2. Related works........................................................................................................ 18
  • 11. ix 2.2.1. Nepali Spell Checker ...................................................................................... 19 2.2.2. Spelling Checker for Afaan Oromo Language ............................................... 20 2.2.3. Spell Checker for Bangla................................................................................ 21 2.2.4. Spell Checker for Arabic language................................................................. 22 2.2.5. Spelling Checker for Amharic Language ....................................................... 22 CHAPTER THREE .......................................................................................................... 25 DESIGN AND DEVELOPMENT OF AMHARIC LANGUAGE SPELLING CHECKER ........................................................................................................................................... 25 3.1. Amharic Language Spelling Checking ................................................................. 25 3.1.1. Amharic Language Inflection ......................................................................... 25 3.1.2. Amharic spelling error patterns ...................................................................... 29 3.1.3. Affix Rules Development ............................................................................... 31 3.1.4. Dictionary development.................................................................................. 32 3.1.5. Lexicon lookup ............................................................................................... 34 3.2. Design of Amharic Spelling Checker (AMSPCH)............................................... 34 3.2.1. Design Requirements...................................................................................... 35 3.2.2. Architecture of the Amharic Spell Checker.................................................... 35 CHAPTER FOUR............................................................................................................. 41 EXPERIMENT, RESULT AND DISCUSSION OF AMHARIC SPELL CHECKER ... 41 4.1. Introduction............................................................................................................ 41 4.2. Prototype ................................................................................................................ 41 4.2.1. Input word processing ..................................................................................... 41 4.2.2. Implementation of spell checker in Open Office using Hunspell................... 42 4.3. Experiment result and Discussion.......................................................................... 44 4.3.1. Evaluation Criteria........................................................................................... 44 4.3.2. Experiment....................................................................................................... 45 4.3.3. Discussion........................................................................................................ 51 CHAPTER FIVE .............................................................................................................. 54 CONCLUSIONS AND RECOMMENDATIONS ........................................................... 54 5.1. Conclusions............................................................................................................ 54 5.2. Recommendations.................................................................................................. 55
  • 12. x REFERENCE.................................................................................................................... 56 APPENDIX....................................................................................................................... 59 Appendix A: Sample of Amharic words taken for experiment ........................................ 59 Appendix B: Amharic alphabets with their seven orders ................................................. 60 Appendix C: Prefix, Infix and suffix lists used in this thesis work. ................................ 62 Appendix D: sample screen shote..................................................................................... 68 Appendix E: Prefix and suffix lists used in this thesis work. ........................................... 72 Appendix F: Min Edit Distance Algorithm ...................................................................... 75 Appendix G: Steps we followed for configuration, compilation, and execution of Hunspell............................................................................................................................ 76 Appendix H: Word counter Python code.......................................................................... 77 LIST OF ABBREVATIONS
  • 13. xi BDU - Bahir Dar University NLP – Natural Language Processing OCR – Optical Character Recognition OOo – OpenOffice.org OS – Operating System POS – Part of Speech ANRS- Amhara National Regional State STICC- Science Technology and information communication commission AMSPCH- Amharic spelling checker LIST OF FIGURES FIGURE2.1 ARCHITECTURE FOR NEPALI SPELL CHECKER [6]......................................19
  • 14. xii FIGURE 3.1 SAMPLE LIST OF DICTIONARY.........................................................................33 FIGURE 3.3 ARCHITECTURE OF AMHARIC SPELL CHECKER ADOPTED FROM [3]....37 FIGURE 3.4 ALGORITHM FOR INPUT COMPONENT ADOPTED FROM [3] .....................38 FIGURE 3.5 ALGORITHM FOR MORPHOLOGICAL ANALYSIS ADOPTED FROM [8]....40 FIGURE 4.1 SAMPLE AFFIX RULE ........................................................................................43 FIGURE 4.3 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 1........................................47 FIGURE 4.4 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 3......................................49 FIGURE 4.5 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 4.......................................50 FIGURE 4. 6 SAMPLE TEXT SCREEN SHOT OF EXPERIMENT 6.......................................54 LIST OF TABLES TABLE 3.1 INFLECTION OF NOUNS BY ADDING SUFFIX “ኦች”.......................... 25
  • 15. xiii TABLE 3. 2 INFLECTION OF NOUNS BY ADDING SUFFIXE “ዎች” ...................... 26 TABLE 3. 3 INFLECTION OF NOUNS BY REDUPLICATION.................................. 26 TABLE 3. 4 INFLECTION OF VERBS.......................................................................... 27 TABLE 3. 5 INFLECTION OF TRANSITIVE ............................................................... 27 TABLE 3. 6NEGATIVE INFLECTION OF VERBS...................................................... 28 TABLE 3. 7 INTERNAL INFLECTION ......................................................................... 28 TABLE 3. 8 AMHARIC PUNCTUATION MARKS...................................................... 37 TABLE 4. 1 EVALUATION RESULT FOR EXPERIMENT 1 ..................................... 47 TABLE 4. 2 EVALUATION RESULT FOR EXPERIMENT 2. .................................... 48 TABLE 4. 3 EVALUATION RESULT FOR EXPERIMENT 3 ..................................... 48 TABLE 4. 4 EVALUATION RESULT FOR EXPERIMENT 4 ..................................... 50 TABLE 4. 5 EVALUATION RESULT FOR EXPERIMENT 5 ..................................... 51 TABLE 4. 6 EXPERIMENT RESULT SUMMERY....................................................... 52 TABLE 4. 7 AVERAGE PERFORMANCE CALCULATED FROM OVERALL PERFORMANCE OF EACH EXPERIMENT ......................................................... 52 TABLE 4. 8 EVALUATION RESULT FOR EXPERIMENT 6 ..................................... 53
  • 16. 1 CHAPTER ONE 1. INTRODUCTION 1.1. Background Amharic language is the official language of Federal Democratic Republic of Ethiopia and which has a population of over 92.21 million, 21.6 million native Amharic language speakers, 4 million secondary Amharic language speakers, 3 million emigrants outside of Ethiopia speak Amharic language. Total of 28.6 million peoples speak Amharic language [2]. Amharic language users use Amharic scripts for document preparation in daily bases. But spelling error is one of the problems. To minimize the spelling error problem, spelling checker tools are used in text processing applications. Spelling checkers have become essential parts of any text processing application software. Different types of spelling checker applications are implemented in text processing tools using different languages. Most commercially available word processors has a spell checker, a grammar checker and even a word list lookup facility as essential part for several languages such as English, French, Portuguese, Spanish, Arabic, etc [3]. In most of African languages no spellcheckers exist and for those languages which have spellcheckers, the adequacy of the actual use is questionable [4]. This is also true for Amharic language which is the official language of the Federal democratic of Ethiopia [4]. When People use spelling checker during document preparation, they save money, time and they can produce better quality and acceptable document. For example, the number of spelling mistakes in English newspapers has dropped considerably by using text processing tools with spelling correctors [5]. Amharic language users are not benefiting from the use of spelling checking and correcting tools. Because, text processing tools do not integrate with spelling checker and spelling
  • 17. 2 correctors. As a result it requires excessive effort and man power to minimize misspelled words in a written document(Newspapers, books, reports, plans, and different publications). So spell checker is important in saving time, money wastage and produce quality document. It also reduces dangerous consequences of mistyped electronic texts such as courts, health, military and other related cases. 1.2. Motivation Different spell checking and correcting techniques developed and implemented for languages such as English, Arabic, Bangla, and so on. But a spell checker and corrector tool developed and implemented for one language cannot be applied to others directly. Because spell checkers are dependent on the characteristics of the language. Hence, specific spell checkers are available for English and some other Latin script based languages [6]. Existing word processing tools support language specific utilities like grammar checking, vocabulary, lexicon, translators, etc for many of the languages. However the absence of spell checker and corrector tool for Amharic language has made document preparation activities difficult, and needs excessive effort to edit and correct documents, reduce documents quality and time wastage. As a result, it is common to see spelling errors in Amharic newspapers and published documents. For example, Figure 1.1 is taken from published document [7] which has up to 9 misspelled words in a single page.
  • 18. 3 Figure 1.1. Sample Amharic text taken from a published document To reduce the mistyped errors some research work has been done using Hunspell for uninflected and inflected Amharic words in Linux OS environment [1, 6]. However, in the previous works, Internal inflected and repeated words are not considered. For example: ገጣጠመ, ሰባበረ, ቆራረጠ, ጌጣጌጥ and so on.
  • 19. 4 1.3. Statement of the Problem Most of the government and private sectors of federal democratic republic of Ethiopia use Amharic scripts for document preparation and a lot of documents prepared in day to day activities, among the problem verifying or edit documents written by the worker or someone else has written. In English language computers have considerably minimized this problem since they automatically detect and correct spelling as well as grammatical mistakes. Because of this writers not only save considerable amount of time and money but they have also started relatively producing better documents. So the number of spelling mistakes in English newspapers has decrease considerably because of the use of automatic spelling correctors [5]. Nowadays there is no applicable Amharic text spell checker integrated with any test processor tools. As described in [4] one of the reason is lack of standardization and complex morphology for Amharic language and Absence of clearly defined spelling rules for Amharic language, Amharic language that has the same alphabets for the same sound are reasons for Amharic language not to be developed. There is no Amharic text spell checker tool or software which has the following features: 1. Amharic spelling checker system with internal inflection, 2. Amharic spelling checker that can check spelling on the fly, 3. Usable application of Amharic spell checker, 4. Spell checker program with exhaustive rules incorporated with repeated words, Therefore, it is very indispensible to develop a spelling checker and integrate to open Office word processor that satisfies the above criteria. Additionally Language specific problems such as lack of standardization can be solved temporarily by using available resources such as dictionary and Amharic books in the development of the project.
  • 20. 5 Due to this problem Amharic language users preferring English language rather than Amharic language. Shewangizaw [4] and Mekonnen [5] developed Amharic language spelling checker. Gaddisa [8] has morphology based Afaan Oromo spelling checker, all [1,5,8] using Hunspell tool but internal inflection was not considered and also the rule was not exhaustive for example compound words such as ጋሻ-አጃግሬ, እራስ-አገዝ. Both Mekonnen [5] and Shewangizaw [4] did not consider internal inflection, repeated words, exhaustive rules (compound words) rather it was a future work. So, this research basis on the above works mentioned try to exhaust compound words, incorporate internal inflection, repeated words and finally designing and implementing Amharic language spelling checker (AMSPCH) based on the previous related works. Thus, this study tries to address the following questions:  What is the suitable tool and algorithm for designing a system?  What is the performance of the system? 1.4. Objective of the Study The following are general and specific Objectives of the study. 1.5. General Objective The general objective of the study is to design and develop automatic Amharic language spelling checker for open office word processor.
  • 21. 6 1.5.1. Specific Objectives To achive the general objective, the following specific objectives are accomplished.  Review different related documents to understand the concept, identify the gap and study on Amharic languages structure of words and their derivations,  Develop Amharic root word dictionary,  Explore Amharic word formation rules from a root word,  Design and develop the prototype,  Evaluate the performance of the system. 1.6. Scope and limitation of the Study 1.6.1. Scope of the Study The study attempts to collect and analyze different related documents, design and implement the Amharic language spelling checker by considering internal Inflection of words, repeated words, and compound words. Finally integrate to open office word processor and study the performance of the system. However real-word error checking and correction is out of the study (i.e. error checking and correction using contextual information is out of the study). 1.6.2. Limitation of the Study In this study, spelling error correction techniques in other languages were investigated. However, due to time constraint we will not consider automatic suggestion and correction of words, we use Levenshtein edit distance spelling error correction technique, which is implemented in Hunspell, is adopted. Affix rules work only the first 65535 Unicode characters is the limitation of Hunspell.
  • 22. 7 1.7. Significance of the Study From this study, Government organizations, students, journalists, teachers and basically anyone who uses Amharic language to prepare document will be beneficiary. This work will have a lot of significance in different areas. Some of them are listed below:  For press company in preparation of Amharic Books, Journals, newspapers and etc  For teaching learning processes in preparation of lecture notes, handout, reports, assignments  For business organizations in preparations of their routine and regular reports, planes etc.  For governmental organization in preparation of rules and regulations, and etc  For anyone who want to write a sensitive report in avoiding or reducing dangerous consequences. (courts, governments, agreement) 1.8. Research Methodology This study is experiment based and considered the following methodology and tools. 1.8.1. Literature Review For proper understanding of the problem and successful completion of this study, different global and local relevant literatures such as Journal articles, conference papers, reports, books, manuals and relevant resources from internet reviewed to achieve the study objectives. The study was done based on previous research works and literatures related with Amharic language spelling checker. In this study, we reviewed different types of spelling checker tools to identify pros and cons.
  • 23. 8 1.8.2. Data Collection and Preparation Since there is no readymade Amharic root word dictionary for Amharic language spelling checker (AMSPCH), different Amharic electronic documents were collected and studied to analyze errors encountered in Amharic documents which are helpful to characterize different types of spelling errors. The Amharic spell checker’s word list was built by combining Amharic dictionary, lists of some common names in Amharic, list of Ethiopian person names, list of common places in Ethiopia, list of abbreviations and lists of some countries in the world were collected from different books and research works. 1.8.3. Implementation tools For this study we use the following off the shelf components.  Amharic Unicode fonts used to type Amharic text  Open Office word processor: though Amharic spell checker can be integrated to closed proprietary word processor such as Microsoft word, we choose the open office word processor because of accessibility of tools and codes.  Hunspell tools: it is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding.  Cygwin terminal tool is used for interfacing.  Word counter python code is used to count number of words in the dataset; it is shown in appendix H. 1.8.4. Performance Evaluation techniques To measure the performance of the new system recall and precision were taken as major criteria. The test data was collected from different region annual report document. Moreover, valid uninflected and inflected Amharic words were used in addition to misspelled Amharic words.
  • 24. 9 1.8.5. Organization of the thesis The reminder of the thesis is organized as follows: Introductory part gave an overview of background and statements of the problem for the study, objectives, scope and limitations of the study, significance of the study and description about the methodology to conduct the study. Chapter two of this study talks about literatures and related works reviewed and provides background information about how spell checker works and types of spelling errors and Related works are that describe spell checker works done for languages like Nepali, Bangla, Arabic, and Amharic. Chapter 3 provides information about Amharic language with its writing system, and design requirements for Amharic spell checker and the architecture of the designed spell checker. Chapter 4 deals with the experiments conducted to evaluate the performance of the spell checker and discuss the obtained results. The last chapter, chapter 5, presents the overall conclusions that have been drawn from the studies reported in this thesis work. Finally, recommendations are given and areas open to future research are also identified and presented in this chapter.
  • 25. 10 CHAPTER TWO LITERATURE REVIEW AND RELATED WORKS In this chapter theoretical concept and types of spelling checker, core functionalities of spelling checking system, spelling checking related works and finally approaches used in developing spelling checker system discussed. 2.1. Literature Review 2.1.1. Amharic text spell checker A spell checker is a tool that enables us to check the spellings of the words in a text file, validates and checks whether they are rightly or wrongly spelled and in case the spell checker has doubts about the spelling of the word, finally suggests possible alternatives [8]. Spell checker operates on a single word at a time. It is either dictionary based or rule based, dictionary based spell checker can be designed in two ways. In the first case, a dictionary contains all root words and their inflection forms. Thus, it is not suitable for languages that have rich morphology such as Amharic and Arabic languages. Amharic language has rich morphology, and spell checkers should be able to handle high inflection of words. But it is easy to develop Amharic spell checker using uninflected word collections. However, as stated in [4], it has less performance and high memory consumption. In the second case, a dictionary contains only root words. This one has better performance and memory consumption. In spelling checker Stemming is important to develop root word dictionary from an existing electronic dictionary. It is the process of reducing morphological variants of a word into a common form particularly by removing prefixes and suffixes. Affix and dictionary of Amharic words can be in Ethiopic script or Unicode data.
  • 26. 11 2.1.2. Types of Spelling Errors There are two types of spelling errors; it can be real word spelling error or non-word spelling error [9].  Real word spelling errors In Real word spelling error a word is correctly spelled but not contextually correct [10]. That means in real word spelling error, it is impossible to decide whether a word is wrong or not without some contextual information. Spelling errors that result in a token, which is a correctly spelled word, though not the ones that the user intended [11], are real word spelling errors.  Non word spelling Error Non-word spelling errors occur when the user writes misspelled word or typed incorrectly [12]. In our research work, we focus on non-word spelling errors. As stated in [13]non- word errors mainly classified into typographic and cognitive errors. Typographic Errors Typographic errors occur when writer knows the correct spelling of the word but mistypes the word by mistake (for example, ሞራረደ vs. ሞረራደ). These errors are mostly related to the keyboard shift key. As stated in [1, 5] and, Typographic errors are classified in to four major types, such as substitution error, deletion error, insertion error, and transposition error. These error types can be of multi error misspellings and single error misspelling. Multi- error misspellings are errors that contain more than one instance of error, whereas, single error misspellings are a single instance of an error in the given word. As stated in [14] the majority of (80%) wrong spellings happen because of one of the following four categories:
  • 27. 12  Single letter insertion, e.g. typing ነኢትዮጵያ for ኢትዮጵያ  Single letter deletion, e.g. typing ኢዮጵያ for ኢትዮጵያ  Single letter substitution, e.g. typing ኢተዮጵያ for ኢትዮጵያ  Transposition of two adjacent letters, e.g. typing ኢዮትጵያ for ኢትዮጵያ In most cases typographic errors are related to the keyboard adjacencies and the most common typographic errors are substitution error types. This error type is mainly caused by replacement of a letter by some other letter whose key on the keyboard is adjacent to the correct letter’s key. As shown in Kukich [15] study, 58% of the errors involved adjacent typewriter keys for English language. Cognitive errors Cognitive errors are also called orthographic errors [13], and it occurs when writer does not know or has forgotten the correct spelling of a word. It is assumed that in the case of cognitive errors, the misspelled word happens by missing the pronunciation of the correct word especially in foreign languages used in Amharic languages (e.g., ኢንፎርሸሚን - >ኢንፎርሜሽን, ኮርኮሬሽን ->ኮርፖሬሽን). 2.1.3. Core functionalities of spell checkers Spelling error detection and spelling error correction are the two core functionalities of a spell checkers. Error Detection is to verify the validity of a word in the language while Error Correction is to suggest corrections for the misspelled or wrongly spelled word [16]. According to [14]study, interactive and automatic are types of Spelling error correction. Interactive spellchecker can suggest more than one alternative correction for each error and the user select one from the suggestion for replacement and in automatic correction, the spellchecker decide and select one best correction and the error is automatically replaced with misspelled word.
  • 28. 13 In automatic error correction is the requirement for those speech processing and Natural Language Processing (NLP) related systems where human intervention is not possible [14]. The spell checking process can generally be divided into three steps, detecting errors, finding correction and ranking correction. Detection and correction are discussed above; Ranking is the listing of suggested corrections in decreasing order of their intended word. 2.1.3.1. Spelling Error Detection Spelling Error Detection is verifying the validity of a word in the language and includes identification of misspelled words and flagging of misspelled words using different detection algorithms. The two main approaches for non-word error detection are dictionary lookup and n-gram analysis method [17]. Dictionary Lookup Technique Dictionary lookup technique is used to check the presence of every input text word in dictionary. If the word is present in the dictionary, then it is a correct word, otherwise it is an error word or misspelled word. The most common technique for gaining fast access to a dictionary is the use of a Hash Table. To look up an input string, one simply computes its hash addresses and retrieves the word stored at that address in the pre-constructed hash table. If the word stored in the hash address is different from the input string or is null, a misspelling is indicated [18]. The challenges of this approach are: A lexicon containing all correct words could be extremely large, resulting in a need of more space and inefficiency searching time and for morphologically complex languages, it is practically impossible to list all correct words. So, instead of storing the word as it is in the lexicon, some sort of rules can be applied to reduce a given word into its root word. This can be done by storing only root words in the lexicon including prefix, infix and suffix information. Then, rules can be applied on the root words of the lexicon by using prefix, infix and suffix information to generate derived words.
  • 29. 14 N-gram Analysis Technique The N-gram analysis or independent spelling error detection method does not use a wordlist or lexicon; instead it uses statistical means to detect misspelled words [4]. This method works by using a large corpus of text from the desired language and by generating a character n-gram from the list. An n-gram is calculated from this corpus. A character n-gram is a sequence of characters where n is the number of letters in the sequence. One, two and three letter n-grams are often referred to as unigrams, bigrams and trigrams, respectively. An example of a trigram analysis of the word ኢትዮጵያ would give the 3-gram set {ኢትዮ, ትዮጵ, ዮጵያ}. By using this technique, strings that contain unusual n-grams can be identified as possible spelling errors. N-gram techniques usually require a large corpus or lexicon training data so that an n-gram table of possible combinations of letters can be compiled. According to [15], N-gram analysis technique is very useful for detecting errors occurred in machine-generated texts such as texts generated by OCR. Its main advantage is that it works without a lexicon. However, for human generated errors, most spell checkers rely on dictionary lookup for error detection; and some applications use a hybrid of these two methods. To use dictionary lookup technique, we need to be careful on the lexicon size and usage of efficient lookup algorithm. 2.1.3.2. Spelling error correction Non word spelling error correction is a process of detecting and providing suggestions for incorrectly spelled words in a text. Spellchecker can suggest one or more corrections for each error and the user selects the best word from the list and replaces the misspelled word. Non word spelling error correction can be done without considering contextual information which is called isolated word error correction [14].
  • 30. 15 Isolated word error correction approach is very helpful for handling non-word spelling errors. In isolated word error correction approach, knowledge about error patterns is very useful. Most misspellings are within one or two characters in length of the correct word. While searching for the correct spelling, we do not usually need to look at words with greater character length difference, especially more than two. Kukich [15] also mentioned that the number of errors occurred at the beginning of a word is minimal. As the probability of getting error in the first letter of a word is less, the process of error correction can be speeded up by concentrating on the remaining letters of the word. Generally Isolated Word Error Correction techniques can be divided into following subcategories [14]: 1. Edit distance techniques 2. Similarity Key techniques 3. Probabilistic Techniques 4. N-Grams Based Techniques 5. Phonetics based techniques Minimum Edit Distance Technique Edit distance is a most effective technique to generate the alternates of wrongly spelled words. In this approach word containing the spelling mistake is compared to every word in the dictionary and various operations like insertions, deletions and substitution and transposition are performed on the word corresponding to every word in the dictionary. The total number of such operations is referred to as the distance. The minimum edit distance is the minimum number of operations (insertions, deletions and substitutions) required to transform one text string into another [19]. In its original form, minimum edit distance algorithms require m number of comparisons between misspelled string and the dictionary of n number of words [15]. After comparison, the words with minimum edit distance are chosen as correct alternatives. Minimum edit distance has different algorithms from this Levenshtein algorithm, Hamming, Longest Common Subsequence are included. Similarity key technique
  • 31. 16 Similarity key technique is to map every string into a key such that similarly spelled strings will have similar keys. Thus when key is computed for a misspelled string it will provide a pointer to all similarly spelled words in the lexicon [4]. Rule Based Technique Rule Based Techniques are algorithms that attempt to represent knowledge of common spelling errors patterns in the form of rules for transforming misspellings into valid words. The candidate generation process consists of applying all applicable rules to a misspelled string and retaining every valid dictionary word those results [20]. Probabilistic Techniques In this, two types of probabilistic technique have been exploited in to transition probabilities they represent a given letter will be followed by another given letter and confusion probabilities they estimates of how often a given letter is mistaken or substituted for another given letter. Confusion probabilities are source dependent because different optical character recognition (OCR) devices use different techniques and features to recognize characters, each device will have a unique confusion probability distribution. N-gram Based Techniques Letter n-grams, including tri-grams, bi-grams and unigrams have been used in a variety of ways in text recognition and spelling correction techniques. They have been used by OCR correctors to capture the lexical syntax of a dictionary and to suggest legal corrections. Phonetics based techniques These techniques work on the phonetics of the misspelled string. The target is to find such a word in dictionary that is phonetically closest to the misspelling [14].
  • 32. 17 2.1.4. Spelling checker tools Ispell, Aspell, MySpell and Hunspell are some of open source spell checker tools integrated with different open source word processors such as liberoffice, Open office [5]. Ispell is a spelling checker for UNIX that supports most Western languages (English (United Kingdom), English (United States), French, German, and Spanish). It offers several interfaces, including a programmatic interface for use by editors such as emacs (it is a popular text editor used mainly on Unix-based systems by programmers, scientists, engineers, students, and system administrators). Ispell only suggest corrections that are based on a Levenshtein distance. It will not attempt to guess more distant corrections based on English pronunciation rules. The generalized affix description system introduced by ispell has been imitated by other spelling checkers such as MySpell [2]. Like most computerized spelling checkers, ispell works by reading an input file word by word, stopping when a word is not found in its dictionary. Ispell attempts to generate a list of possible corrections and presents the incorrect word and any suggestions to the user, then choose a correction, replace the word with a new one, leave it unchanged, or add it to the dictionary [5]. Another open source spelling checker tool is aspell. It is spell checker program designed to replace Ispell. Its primary advantage over Ispell and other existing spell-checkers is the suggesting of possible replacements for a misspelled word. Aspell has also the capability to spell check UTF8 encoded documents without the use of an additional dictionary. Aspell includes support for multiple dictionaries at once, which Ispell does not do. MySpell is a spellchecker based on Ispell. MySpell is used by OpenOffice.org and Firefox/Mozilla and works on both Windows and Linux [5]. Hunspell Spellchecker is the next generation of Myspell, has been improved in order to support additional features for different languages, especially for Hungarian language, as well as other languages such as German and Turkish [21]. In general Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character
  • 33. 18 encoding. Hunspell becomes attractive spell checker for many languages such as Amharic language and Arabic language because of the following features [5]  Unicode support,  Morphological analysis and stemming,  Support complex compounding,  Support language specific features,  Handle conditional affixes, circumfixes, forbidden words, pseudo roots and homonyms,  Free and open source software, All Ispell, Aspell, MySpell and Hunspell uses a dictionary file (.dic) and affix file (.aff). The dictionary (.dic) is a list of words with their corresponding affix rules. The affix file describes each of the prefix, Infix and suffix based rules. Affix is a linguistic element added to a word to produce an inflected or derived form. An affix can be placed at the beginning (prefix), middle (infix), or end (suffix) of the root or stem of a word [21]. However, affix rules used in the spell checkers mentioned above are either prefix, infix or a suffix rules. Amharic electronic text spell checker should include feature such as analyzing the rich morphological structure of Amharic language and support of the Unicode encoding. Hungarian spell checker, it is based on Hunspell is capable of analyzing complex morphological nature of language and supporting Unicode encoding. So in our study Hunspell spelling checker tool is used. 2.2. Related works Different language specific spell checkers have been developed to improve spelling error problems that are created due to specific nature of the language in document preparation using word processors. In this portion we try to show some of those language specific spell checkers. In addition, it tries to see sources of spelling error variations for Amharic language.
  • 34. 19 2.2.1. Nepali Spell Checker Keyboard adjacencies, shift key characters, phonetic similarity, and visual similarity are indicated as the main causes of spelling mistakes in Nepali writing system [22]. Architecture of the Nepali spell checker as shown in Figure 2.1, Nepali spell checker has three components namely: Morphological Preprocessing Module, Lexicon Lookup/Error Detection Module and Suggestion Module. Each module can be easily incorporated to develop a new spell checker for other languages and also can be used to device new techniques and procedures for Nepali language. Figure2.1 Architecture for Nepali spell checker [6] Lexicon Because the size of the lexicon is an important factor for the efficiency of a spell checker, only Nepali root words are stored. Error Detection Module Error detection module deals with lookup of the input word in the lexicon. Token (Nepali word) is input to the Error detection module. This module searches the input word in the lexicon, if it is not found, it will be sent to morphological preprocessing module. In addition, it accepts the word which is broken down by the morphological preprocessing
  • 35. 20 module and then searches it in the lexicon. If the root word is not found in the lexicon, spelling error is detected. It then sends the word to the suggestion module for correction. Morphological Preprocessing Morphologically complex words are broken down into root words in this module, which are then searched into the lexicon. To do this, the researchers used morphological rules to reduce the size of the lexicon. Morphological preprocessing module uses a Nepali porter stemmer to breakdown the morphologically complex words into roots and affixes. Spelling Error Correction and Suggestion Module The suggestion module receives token when spelling error is detected. For the purpose of spelling error correction and suggestion, it uses the edit distance algorithm (more specifically Levenshtein edit distance algorithm). Evaluation The researchers used lexical recall (which indicates the percentage of valid words correctly accepted), error recall (that gives the percentage of invalid words correctly flagged), precision (which indicates the percentage of correctly flagged words), and suggestion adequacy (that indicates how adequate the correct suggestion is) as evaluation metric for their spell checker. 2.2.2. Spelling Checker for Afaan Oromo Language As stated in Gaddisa [8], Afaan Oromo is a Cushitic language family. It is an official language of Oromiya regional state and it has a very rich morphology. The system is designed in a dictionary look-up with morphological rules. Morphological rules in Afaan Oromo Language address word categories and their possible inflections, derivation and compounding.
  • 36. 21 The architecture has eight components: Tokenize, Knowledge base, Error detection, Morphological analyzer, Error correction, Morphological generator, Suggestion ranker and Word assembler. The system uses English characters, and the inflection of words different from Amharic inflection of words. In the system Levenshtein Edit Distance algorithm used to rank the suggestion. Finally the suggested word with the shortest distance to the misspelled word is considered as the best suggestion. The research used accuracy, performance, precision, and recall for evaluation of the spell checker. The accuracy measures how high the prototype suggests for the generated errors. The performance measures the efficiency for the prototype in terms of time it takes to generate the correct suggested word. On the other hand, precision and recall measures the number of correct suggestions in the total number of spelling suggestions [8]. 2.2.3. Spell Checker for Bangla According to [23], Phonetic similarity of Bangla characters, difference between the grapheme representations and the phonetic utterance are the most common reasons for spelling errors in writing Bangla language. To produce good suggestions for these spelling errors, methods based on edit-distance and fuzzy string matching algorithms have been done for the language. The research work [24], done by Naushad UzZaman and Mumit Khan from BRAC University, presents a double metaphone encoding algorithm for Bangla that can be used by spell checkers to improve the quality of suggestions for misspelled words in the language. The researchers presented how this encoding system effectively encapsulates the complex rules for Bangla and dialectic pronunciation differences that are not possible to handle by using the traditional edit-distance methods. They compared the proposed Double Metaphone algorithm with the edit distance based methods in producing suggestions for misspelled words.
  • 37. 22 2.2.4. Spell Checker for Arabic language In 2003, Khaled Shaalan and Amin Allam [25], University of Cairo, developed an Arabic morphological analyzer. In addition, they devised techniques for spelling error detection and corrections for Arabic language by investigating common spelling errors in Arabic text writing. The researchers analyzed and classified common spelling errors in writing Arabic word as:  Reading Errors: Such kind of spelling errors could occur when the writer inputs a word from a written documents or visual similarity of some characters in the language.  Hearing Errors: such errors can occur when the human writer is being dictated and he/she might recognize a character as another one. This might occur from pronunciation differences.  Morphological errors: errors in this category might be the result of nonnative speakers of Arabic language or a non well-educated writer.  Editing errors: these are the most common errors in other languages [4] like insertions, deletions, substitutions, and transpositions. Their study has mainly focused on spelling errors correction to isolated words. They proposed the spelling correction method categorized as ‘add missing character’, ‘replace incorrect character’, ‘remove excessive character’, and ‘adding a space’ to split a misspelled word into two or more words. In adding missed character, the spell checker adds a missing character in every possible position. If the modified word matches a word in lexicon, the new word will be added to the list of candidates. Similarly, in replacing incorrect character, their tool replaces every character with one of its neighbors according to some rule. And, if a new word is found in the lexicon it will be added to the list. For adding a space to split words, the tool adds a space in every possible positions and the newly formed word will be added to the suggestion list (when it is found in the lexicon). 2.2.5. Spelling Checker for Amharic Language
  • 38. 23 Different researchers were done their studies on AMSPCH. From this Hunspell based Typographic Amharic Spell Checker was done by Getaneh Woldeyesus at Graduate School of telecommunications and Information Technology, Ethiopia. This work mainly focuses on how Hunspell, an open source spell checker with morphological analyzer library originally developed for Hungarian language spell checker, can be used to provide for uninflected Amharic typographic spell checking process. In this case, the research used the model of Hunspell as a solution for Amharic spelling errors detection and correction [4]. First of all the study generates typographic errors for Amharic language and the technique for generation of errors works first by selecting words that have three or more characters from the lexicon, and then selects a position to start the error generation randomly. In the implementation part of this work, the Hunspell was modified by removing Hungarian languages specific function calls, capitalization checking removal, and truncation of affixation rules. In addition, a new word list was generated from the word list to avoid inflected words from the list. The research used accuracy, performance, precision, and recall for evaluation of the proposed spell checker for Amharic language. The accuracy measures how high the prototype suggests for the generated errors. The performance measures the efficiency for the prototype in terms of time it takes to generate the correct suggested word. On the other hand, precision and error recall measures the number of correct suggestions in the total number of spelling suggestions [4]. The work is done on uninflected Amharic words and errors are generated randomly. First, generation errors do not reflect how the trends of spelling errors look like in Amharic text writing. Second, Amharic is a Semitic language with complex morphology. This implies that we need to consider morphological analysis in developing Amharic spell checkers. Due to the above reasons, much more effort is needed to study the existence of spelling errors in Amharic text writing and works on spell checking for inflected Amharic words especially internal inflected Amharic words.
  • 39. 24 On the other hand, as stated in [5], Amharic language words and their inflections such as: inflection of nouns, inflection of verbs and inflection of adjectives and as Shewangizaw [4] identified and studied Amharic error patterns and affixes of Amharic words listed for each Amharic part of speech. Mekonnen [5], Gaddisa [8] and Shewangizaw [4] works were implemented based on the frame work of Hunspell, that is default Open Office spell checker. In Shewangizaw [4] spell checker checks Amharic text written with other word processor by manually copying and pasting on Open Office and study the dictionary file and affix file using Latin script and he used transliteration components to translate Amharic texts to Latin and Latin back to Amharic and the internal inflection not considered. In Mekonnen [5] study, root words in the dictionary, prefixes and suffixes in the affix rule respectively without translation. In both [4] and [5] the following activities were not addressed:  Management of internal inflection: example ቆራረጠ is derived from root word ቆረጠ. ራ is added inside the word ቆረጠ before the character ረ.  Consider real word error checking and correction based on contextual information.  Consider auto correction, extra spaces removal and repeated word removal.  Consider Amharic fonts independent spell checker. It is Unicode dependent not other Amharic fonts are some of the limitation of the previous works.  Prefix and suffix rules were not exhausted. So in this study, internal inflection of words, compound words, repeated words, prefixes and suffixes are exhausted.
  • 40. 25 CHAPTER THREE DESIGN AND DEVELOPMENT OF AMHARIC LANGUAGE SPELLING CHECKER 3.1. Amharic Language Spelling Checking 3.1.1. Amharic Language Inflection As stated in Getahun [26], Amharic part of speech tagging (POS) is categorized in different classes namely Noun, Verb, Adjective, Preposition, and Adverb. These classes can be inflected by number, gender, case, definiteness, pronoun, tenses, and person [5]. In this study we investigate how root words are inflected and develop a rule based on the investigation. Inflection of nouns Amharic nouns can be inflected by number, gender, case and definiteness. Nouns are inflected by adding affixes and by reduplication of nouns [2, 18]. The inflection of noun is done by adding two suffixes “ዎች” and “ኦች”. By adding the suffix “ዎች” the word “በሬ” is inflected to “በሬዎች” and using the suffix “ኦች” the word “ዶክተር” is inflected to “ዶክተሮች”. Examples are shown in table 3.1 and 3.2. Table 3.1 Inflection of nouns by adding suffix “ኦች” ነጠላ ቁጥር ብዙ ቁጥር ዶክተር ዶክተር-ኦች ዶክተሮች አያት አያት-ኦች አያቶች ቤት ቤት-ኦች ቤቶች
  • 41. 26 ወንበር ወንበር-ኦች ወንበሮች ፍየል ፍየል-ኦች ፍየሎች በግ ብግ-ኦች በጎች እግር እግር-ኦች እግሮች እስስት አስስት-ኦች እስስቶች Table 3.2 Inflection of nouns by adding suffix “ዎች” ነጠላ ቁጥር ብዙ ቁጥር ገበሬ ገበሬ-ዎች ገበሬዎች እርሻ እርሻ-ዎች እርሻወች ሸማኔ ሸማኔ-ዎች ሸማኔዎች ነጋዴ ነጋዴ-ዎች ነጋዴዎች በሬ በሬ-ዎች በሬዎች ተማሪ ተማሪ-ዎች ተማሪዎች Another technique of derivation of nouns or inflection of nouns is reduplication; it is done by repeating the word itself with some modification. The six alphabet is converted into the forth alphabet and repeat the first word. The example was shown in table 3.3. Table 3.3 Inflection of nouns by reduplication ነጠላ ብዙ ግርድ ግርዳግርድ ጌጥ ጌጣጌጥ ትል ትላትል ብረት ብረታብረት ጥሬ ጥራጥሬ ሸቀጥ ሸቀጣሸቀጥ ጨርቅ ጨርቃጨርቅ Amharic Words can also inflected based on gender by adding “ኢት”, the word “በግ” can be inflected to “በግ-ኢት”, “ልጅ” can be inflected to “ልጅ-ኢት” by adding “ኢት”. The other form of noun inflection is based on cases which concerns usage of the word in a sentence such as
  • 42. 27 subject and object. It is done by adding suffixes such as “-ን”, “-ኤ”, -ህ”. for example by adding “-ን” in the word “ልጅ” we get “ልጁን”, by adding “-ኤ” we get “ልጅ-ኤ” we get “ልጄ” The last noun inflection form is based on definiteness which is done by adding suffixes “ኢቱ”, “ዉ”, “ኡ”, “ዋ” and “ይቱ” [26]. Inflection of Verbs Verbs have affixes that show subject and object of a sentence [5]. These affixes are “ሽ”, ”ች”, “ህ” etc. Verbs inflected by person, gender, number and tenses. An affix that shows third person, singular, female and past tense is “ሽ” in “ሄድሽ”, similarly by adding “ህ” we get “ሄድህ” and the affix shows third person, singular, male and past tense is “ህ” in “ሄድህ”. As stated in [5], Amharic verbs are the most inflected part of speech in Amharic language. So as described in Table 3.4, 3.5 and 3.6 in the first column, Verbs in the form of perfect tense or verbs that indicate third person singular male gender are considered root words in this study. All root verbs are inflected to compound imperfect, gerund, contingent and infinitive. Compound imperfect verb is derived from root word by adding affixes such as ይ--አል, ት--አለች, ይ--አሉ, ን--አለን, ት--አላችሁ, ት--አለህ, ት--አለሽ and etc. Gerund form of a verb is obtained by adding ኦ suffixes. Contingent and infinitive form of a verb are obtained by adding ይ-- and መ respectively as shown in table 3.4. Table 3.5 and 3.6 depicts affixes used in transitive verbs and negation of verbs. Table 3.7 shows the internal inflection of words. Table 3.4 Inflection of verbs Root ይ------አል ------ኦ ይ-------እ መ-----እ ሄደ ይሄዳል ሄዶ ይሂድ መሄድ ቆረጠ ይቆርጣል ቆርጦ ይቁረጥ መቁረጥ ሮጠ ይሮጣል ሮጦ ይሩጥ መሮጥ Table 3.5 Inflection of transitive Root አስ--- ተ---
  • 43. 28 ገደለ አስገደለ ተገደለ በላ አስበላ ተበላ በላች አስበላች ተበላች Table 3.6 Negative inflections of verbs Root አት------ም አል------ም አይ-------ም አን-----ም ተማረች አትማርም አልማርም አርምይመ አንማርም በላች አትበላም አልበላም አይበላም አንበላም ገደለ አትገድልም አልገድልም አይገድልም አንገድልም በላ አትበላም አልበላም አይበላም አንበላም Table 3.7 Internal inflections Root Inflection ገጠመ ገጣጠመ ቆረጠ ቆራረጠ ሰበረ ሰባበረ ቀመሰ ቀማመሰ ቆመጠ ቆማመጠ ከተፈ ከታተፈ ገመጠ ገማመጠ ደረበ ደራረበ ከመረ ከማመረ ሰነጠቀ ሰነጣጠቀ ገነጠለ ገነጣጠለ ገለጠ ገላለጠ
  • 44. 29 መረጠ መራረጠ Inflection of Adjectives Adjectives are inflected by numbers, cases and definiteness. Inflection of adjective is similar to inflection of noun when it is inflected by number, cases and definiteness. For example by adding ኦች the word ጅል, ብልህ, ጠባብ, ጎበዝ inflected to ጅሎች, ብልሆች, ጠባቦች, ጎበዞች respectively. 3.1.2. Amharic spelling error patterns In addition to Amharic language spelling error patterns presented by Shewangizaw [4] and Daniel [27], compound words, abbreviations and mistyping are identified as a sources of spelling error variations for Amharic language. These sources or error variations and study for trends of spelling errors for Amharic language are presented below. Compound words Amharic writing system uses different compound word writing techniques and there is no standard to write compound words as a two separate words or a single word [5]. As a result of this, we get Amharic words having the same meaning but in different ways of context. For example, it is not clear to select which of these words are right to use: ሰዉ ሰራሽ, ሰዉ- ሰራሽ ወፍ ዘራሽ, ወፍ-ዘራሽ እጅ አዙር, እጅ-አዙር, ወጥቤት, ወጥ-ቤት, አየርወለድ, አየር-ወለድ, አጥር ግቢ አጥር- ግቢ. So in our study we use Getahun [26] compound words writing system by concatenating two words using hyphen. Example አጥር-ግቢ ልብስ-ሰፊ, ሰዉ-ሰራሽ, እንጀራ-ጋጋሪ, ሰርቶ-አዳሪ, አዉቆ- አበድ, ሰርጎ-ገብ, መንፈቀ-ሌሊት, ቤተ-ክርስተያን. Abbreviations
  • 45. 30 In English language, it is commonly used words in abbreviation forms. Example Dr, Mr., etc. Similarly Amharic language allows writing a word idifferent abbreviation forms for a single word. So this can be a source of Amharic spelling error variation. For example, when abbreviating the phrases (ጠቅላይ ሚኒስትር, one can find ጠ/ሚ or ጠ/ሚኒስትር) ዶክተር can be identified as ዶ/ር, ክፍለከተማ can be written as ክ/ከተማ. Therefore, these kinds of words should be handled in a spellchecker application when a user enters any of these words. Syllographic redundancy Most of Amharic vocabularies are originated from the Geez language [4]. However, it lacks to preserve Geez’s phonology while it takes its symbol for some of the characters. In addition to having same pronunciation, each character has its own order. Such types of issues are inherent from Amharic symbol redundancy and need to be addressed in spell checking process. Example: “አለምፀሀይ” and “አለምጸሀይ” for “ዓለምፀሐይ”. Glypheme misidentification This source of errors occurs due to visual similarity of some Amharic characters. Most of the time, the characters: ‘ው’ and ‘ዉ’, ‘ፖ’ and ‘ፓ’, ‘ዪ’ and ‘ዩ’, ‘ጕ’ and ‘ጒ’, ‘ቁ’ and ‘ቍ’ are used interchangeably in Amharic writing system. Due to visual similarity of characters, users may simply choose the form that is easiest to write by hand or type into a computer. “ነዉ” instead of “ነው” can be taken as glypheme misidentification error types. False Geezims This type of source of variations in Amharic words occurs due to inserting the wrong letter, mostly characters that have to be silent. Example: “ምልክት” vs. “ምልእክት”. Assimilation and Alternations It is common in Amharic that, ‘ም’ may be exchanged for ‘ን’ before ‘በ’, as in “ሽንብራ” vs “ሽምብራ”. This is one source of spelling variation in Amharic writing system.
  • 46. 31 Foreign language transcription Amharic has some words that are taken from other languages, very often technology terms. Until some convention emerges there will be conflicting way of writing in transcription of those words. Example: “ኮምፒዩተር” vs. “ኮምፒውተር”. Dialect variation Regional dialects can also impact word formation in the basic level where the words are more likely to be written following their spoken form; “ሆመጠጠ”, “ኮመጠጠ”, “ሂጂ” vs “ሂጅ”, “አይዶለም” vs “አይደለም” , “ዓጤ” vs “ዓፄ” are some of them. Mistyping It is common to type the misspelled words other than the correct ones during document processing. The two common reasons for mistyping Amharic words are the number of keys on a keyboard is fewer than number symbols and different Amharic word processing tools have different keyboard layout for inputting Amharic word. In phonetic based input methods, mistyping comes from “shift-slip”, Example “ቴና” for “ጤና” [4]. 3.1.3. Affix Rules Development As described in Section 2.1.3.1, the drawback of storing all forms of words in the lexicon is, a lexicon containing all correct words could be extremely large. As a result it needs more space, inefficiency searching time and it is practically impossible to list all correct words. To minimize this problem we stored only root words in the lexicon and the input word from the input component checked against the lexicon words by considering root and inflected words by developing affix rules.
  • 47. 32 During affix rule development, Prefix, Infix and suffix lists collected from different documents manually. The dictionary is built from 24649 words, and for affix rule built from 60 prefixes, 1752 suffixes, and for internal inflection words 86 rules and the sample Prefixes, infixes and suffixes used in this work are listed in Appendix C. Then the identified prefix, infix and suffix lists need to be categorized so that they can be integrated to each lexicon entry. According to [26], Amharic POS can be categorized in five major classes namely Noun, Verb, Adjective, Pronouns, and Adverb. Nouns can be inflected for number, gender, case and definiteness. Verbs can be inflected for person, gender, number, mood, and tense. Adjectives are inflected for number, case, and definiteness. Based on this, we categorized the identified suffix lists and each category was given a unique identifier. After suffix and prefix lists are categorized, there should be a rule that indicates how a given word takes a suffix. For example, a suffix “-ዎች” is allowed for a word በሬ. Hence, rules were developed which handles such cases. 3.1.4. Dictionary development Amharic root word dictionary is compiled from different sources. Amsalu Aklilu [28], Concise Amharic Dictionary Amharic to English and English to Amharic dictionary [29], ጌታሁን አማረ [26] and ባየ ይማም [20] is taken as base dictionary as they contain part of speech for many words, compound words and phrases. While developing the dictionary, this study uses the following steps:  Remove inflected words from dictionary  Remove phrases made of two or more words
  • 48. 33  Add some verbs that are not available in Amsalu Aklilu dictionary and from Concise Amharic Dictionary Amharic to English and English to Amharic dictionary.  Add country names and common person names  Normalize dictionary entries  Append rules to each words in the dictionary partially based on Amharic part of speech.  Produce a text file consisting of list of Amharic words one per line. As shown in figure 3.3 Words are listed one word in a row followed by affix rule identifiers that should be applied to that rule. In first line we should write approximate number of words. Any word in the dictionary is followed by forward slash and 0 or more flag identifier. The output is a file (am_ET.dic) with .dic extension. It is an input for Amharic spell checker program. As can be seen in figure 3.3, the first entry number 6 indicates estimated number of root words in the dictionary. Amharic words starting the second line are lists of root words. The slash after Amharic words is used to indicate that it is end of root word and beginning of rule identifier. All characters after slash symbol are rule identifier Figure 3. 1 Sample list of dictionary 6 ሱቅ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN ሰበረ/MKIMINPOHN ዜማ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN ሰለለ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN ሰገረ/MKNIMINASAGAWANMWMOCWCPOAUAEETNNO2WMWN ቢሮ/MKNIMINASAGAWANMWMOCWCPOAUAEACWMWN
  • 49. 34 defined in affix files. Rule identifier, MK points affix rules that append prefixes of verbs such as ለ, በ, ከ, የ, ስለ, እንደ, ያል andእስከ. The dictionary is built from 24649 words, and for affix rule built from 60 prefixes, 1752 suffixes, and for internal inflection words 86 rules. 3.1.5. Lexicon lookup A lexicon lookup algorithm implemented as the long linked list requires going all the way to the end of the list. Checking every element for equality with a given input word is very inefficient and slow especially for large lexicon words. Hence, a method known as hash table dictionary lookup method was implemented for lexicon lookup process. In this work, we adopted a dictionary lookup algorithm developed by the author of Hunspell [30]. A Hash table dictionary lookup method was implemented first by calculating a hash function for a given string. This value is obtained by manipulating the bytes of the given string [6]. 3.2. Design of Amharic Spelling Checker (AMSPCH) In the design of spell checker has to incorporate general features of a spell checker and language specific components for the targeted language. In this chapter, we will try to discuss general and language specific requirements for the designed Amharic language spell checker. In the design of AMSPCH task issues and requirements addressed. To do spell checking task, first we need to input words and present as tokens; hence, we have introduced an input component. Other components of our spell checker are normalization component, error detection component, morphological analyzer, and error correction and suggestion component. All of these components are briefly discussed below in3.2.2.
  • 50. 35 3.2.1. Design Requirements Lexicon lookup speed, selecting an appropriate technique for detecting and correcting spelling error, and storage requirements are the general factors that are needed to be considered in designing a spell checker. Besides, the general requirements of a spell checker design, one has to consider language specific features of a spell checker. In this case, our spell checker takes the following typical features of Amharic language that affects the spell checking process. Morphological variants of words Amharic is one of the languages with rich morphology. As it is discussed earlier, one of the tasks in spell checker is developing a lexicon. There are two options to develop the lexicon, the first one is to store all forms of words in the lexicon and retrieve from this list, and the second have only root words in the lexicon then create some affix rule or algorithm for validating all acceptable words (inflected words) for the language. However, the first option can have two problems one is the performance, and the other is getting all forms of words for the given language. To avoid above mentioned problem, developing affix rules (prefix, infix, and suffix rules) has a solution in the spell checking process [6]. So in this study we use the second option. Encoding issue Previously, Amharic electronic documents were developed using mostly incompatible software based on different encoding systems. However, software vendors for Amharic word processing have started to use Unicode in recent times. Moreover, Unicode seems to be choice of preference to represent Amharic documents. This study focus on Amharic documents written using Unicode encoding. 3.2.2. Architecture of the Amharic Spell Checker
  • 51. 36 Figure 3.1 depicts the architecture of Amharic spelling checker. The architecture presents the components of Amharic spelling checker. Our spell checker is designed to check whether a given Amharic word is correctly typed or not and gives suggestion for incorrectly typed words. To achieve this goal, five components are introduced. The components are: • Input component, • Normalization Component, • Error Detection Component, • Morphological Analyzer Component, • Error Correction and Suggestion Component, From the above five components, the Normalization and Morphological analyzer components have language specific features and should address the Amharic language specific characters that are related to spell checking process. Amharic Spelling Checker Architecture
  • 52. 37 Input Component An input component is responsible to read characters from open Office tokenize them. When the user input a word, the input module read characters one by one from open Office word processor. If the user presses space bar or punctuation marks shown in table 3.8 or pastes electronic documents, then the input component tokenize texts. On the other hand if visual or Syllographic redundant characters, then the input module represents them by their predefined representative. After tokenized the text, then it passes the input word to Normalization component for further processing. The algorithm for this component is presented in Figure 3.2. Table 3.8 Amharic punctuation marks Word separator period comma colon Semi- colon Preface colon Question mark Exclamation Mark : :: ፣ ፤ ÷ :- ? ! Figure 3.2 Architecture of Amharic spell checker adopted from [3]. Begin 1: Make token empty 2: Read a character 2.1 If character is one of Amharic punctuation marks call Normalization module Else if a character is end of file Exit Else append the read character to the token 3: Move pointer to the next character then go to step 2
  • 53. 38 Figure 3.3 Algorithm for Input component adopted from [3] Normalization Component As it is discussed in section 2.2.5, one source of error variation in Amharic is Syllographic redundancy or the presence of some repetitive alphabets that can be used interchangeably in Amharic words. A spell checker for such a language should be able to address this issue. As a result, those types of words should come into a common form. Hence, one use of this component is to apply a rule on input words which have Syllographic redundancy problem. Error Detection Component This component accepts the decomposed word from the Morphological analyzer component. Then, it checks whether the returned word exists in the lexicon or not. Consequently, the error detection component passes the non-word to the Error correction component so that the user gets the suggested list. Amharic spell checker is dictionary based spell checker. Misspelled words are identified by using dictionary lookup algorithm. First check the existence of token output from
  • 54. 39 Normalization module in the dictionary. If exists, then it is root word and treated as correct spell word. Otherwise, it is passed to morphological analyzer to check if it is one of inflected words or not. If morphological analyzer strips affixes added to a root word, then root word is passed to error detection modules to check it existence in the dictionary [5].  Morphological Analyzer Component As described in 2.2.6, Amharic is a morphologically complex language, whose basic units are mostly consonantal roots. As a result of its complexity, all classes of words are highly inflected and contain lots of information in a single word. This situation has to be addressed while designing the spell checker for such language. In other words, we need to accept all valid inflected words in addition to the root words. Other than a spell checker, morphological analyzers were used for information retrieval, POS tagging, Machine translation etc [4]. The task of our Morphological analyzer components accepting a word from the error detection component, decomposing the input word into stem and affixes based on predefined Amharic language word formation rules, and then passing the resulted stem and affix to the error detection module. This morphological analyzer is limited to inflectional morphology. As a result, it considers internal derivational morphology (for example ፈለገ into ፈላለገ, ገጠመ into ገጣጠመ). This kind of words will be considered as an internal inflection and rules are applied to it. We adopted the morphological analysis methods used by Hunspell [30] for the claimed misspelled word, first by developing word formation rules for Amharic language. The details of these rules are presented next. Input: word I_Word from Error detection component Output: list of affix and root words Start 1. Scan input word from right to left and left to right to look for valid suffix and prefix For each valid suffix in I_Word strip them and store result in a buffer For each valid prefix in I_Word strip them and store result in a buffer //pass list of affix and stems to the error detection module Return root and affix
  • 55. 40  Error Correction and Suggestion Component After the input word is flagged as a non-word, the spelling error has to be corrected and we get a list of suggested words so that we will select from the list. The error correction and suggestion component was designed to accomplish this task. Hence, this component inputs a word from Error detection component, searches all possible list of corrections from the lexicon as a suggestion, and it ranks list of words. In this work, Levenshtein edit distance has been used for error correcting and suggestion. Detail, see Appendix F. Levenshtein edit distance algorithm works by defining some threshold value which indicates a maximum distance for possible list of words as a suggestion. Shewangizaw [4] finds a single word error contributes 88.8% of the total error. Figure 3.4 Algorithm for Morphological Analysis adopted from [8]
  • 56. 41 CHAPTER FOUR EXPERIMENT, RESULT AND DISCUSSION OF AMHARIC SPELL CHECKER 4.1. Introduction This section describes the detail of the experiment based on source of error variation, error detection and correction techniques, and spelling error trends in Amharic documents. 4.2. Prototype 4.2.1. Input word processing As stated in section 3.2.2, the designed Amharic spell checker has an input component which takes a text file as an input and applies Tokenization and Normalization to generate words for error detection component. Tokenization is the process of breaking up the given text into units called tokens. The tokens may be words or number or punctuation mark. It can occur at a number of different levels: paragraphs, sentences, words, syllables, or phonemes [31]. This process needs word
  • 57. 42 boundaries of a given text or punctuation marks and encoding of a given language. In this work, only words with Unicode encoding are demarcated or tokenized. As discussed in section 3.2.2, Amharic has its own punctuation marks that demarcate words, sentences, etc. But instead of using punctuation marks white spaces are used to demarcate Amharic words in electronic documents. Therefore, tokenization for Amharic text is done by considering all Amharic punctuation Marks (i.e. word separator, period, comma, colon, semicolon, preface colon, question mark, and exclamation mark) and white space. 4.2.2. Implementation of spell checker in Open Office using Hunspell As discussed in section 2.2.6, Hunspell is the default spell checker for openoffice.org. It requires two files to define the language spell checking. The first file is a lexicon containing words for the language (Amharic words in our case), and the second is an affix file that defines the meaning of special flags in the lexicon. This affix file contains the prefix and suffix rules to be associated with the words in the lexicon. As shown in figure 4.1 a lexicon file (am_ET.dic) contains a list of Amharic root words. The first line of the lexicon contains approximate number of entries in the lexicon file. Each word may optionally be followed by a slash (“/”) and one or more flags, which represents prefix infix and suffix rules. An affix file (*.aff) may contain a lot of optional attributes. For example, SET is used for setting the character encoding of affixes and lexicon files. PFX and SFX defines prefix and suffix classes respectively named with affix flags. The following example describes the structure of the affix file of Hunspell. Affix file: SET UTF-8 1. SFX OC Y 27 2. SFX OC 0 ዎ‹ [^IMU`eipwê‹”˜¡¨<ÃÉÏ´»åêõý] 3. SFX OC I J‹ I 4. SFX OC M KA‹ M 4. SFX OC p q‹ p - - 5. SFX AA ý þ‹
  • 58. 43 As shown in figure 4.1, ህልቅ and ፕ in third column are characters that are stripped before affixation. ህልቅ and ፕ in fifth column are characters that are checked if one of them is last character of a word before affixation. 0 in line two and third column indicates that nothing is stripped when ች is affixed. ሆች, ሎች, ቆች, ፖች are affixes to be affixed if specified condition is fulfilled. [^ህልምሽቅብትችንኝእክውዝዥጵጽፍፕ] is a condition to check that the last character is not one of fifth order characters in Amharic scripts. So figure 4.1 shows how to define rules that is used in derivation of nouns to their plural forms. Amharic verbs are highly inflected than other Amharic part of speeches. It is inflected by person, tense, Gender, and cases. Some prefixes and suffixes of verbs are dependent to each other. This dependency is controlled by CIRCUMFIX commands. CIRCUMFIX XX PFX EE Y 1 PFX EE 0 እን /xx SFX TA Y 2 SFX TA O ለን/EEXX [ሃላማሳራሻቃባታቻናኛዛዣያዳጃጳጻፋፓ] SFX TA ደዳለን/EEXX ደ If ሄደ/TA and በላ/TA are in the dictionary, እንሄዳለን and እንበላለን are valid inflected words where as ሄዳለን, በላለን, እንሰብረ and እንበላ are invalid words and marked as misspelled words. Amharic verbs have also subjected and/ or object indicator suffixes. For root word ገደለ, we get its inflected form ገደሉዋቸዉ by adding ዉ which is subject indicator and ኣቸዉ which is object indicator. Figure 4.1 sample affix rule
  • 59. 44 In this work the rule identifiers and their corresponding affixes are listed in Appendix B. Because of its complexity, Amharic verbs need exhaustive rules. The output is a file named am_ET.aff with .aff extension. It is an input to error detection and correction modules. 4.3. Experiment result and Discussion In this section evaluation criteria for the prototype followed by how the training and testing data was prepared and described. In addition the results obtained are presented and discussed in this section. 4.3.1. Evaluation Criteria The system is evaluated to test its effectiveness. Different research works have proposed various criterion for evaluation of a given spell checker. Shewangizaw [4], Gaddisa [8] and Mekonnen [5] recommend that error recall, precision recall and suggestion adequacy for the evaluation of spell checker algorithm. The performance of the system is evaluated using precision and recall. Precision can be seen as a measure of exactness, whereas Recall is a measure of completeness. Precision and recall are defined in [5, 33], for information retrieval as follows. Precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search, and Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved). Precision and Recall are define in [5, 33]. In the following way for statistical classification tasks. Precision for a class is the number of true positives divided by the total number of elements labeled as belonging to the positive class [32].
  • 60. 45 The formula for calculating recall is = ∑ True Positive ∑True Positive + ∑ False Negative Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class [32]. The formula for calculating precision is = ∑ True Positive ∑True Positive + ∑ False Positive  True Positive – which means that the spell checker identifies correctly spelled words  False Positive – which means that the spell checker treats misspelled words as correct spelled words  False Negative - This means that the correct spelled words are flagged by spell checker as incorrect word.  True Negative - which means that the spell checker identifies misspelled words The dataset is taken from different region reports which were used to study error trends in Amharic texts. Five sets of test data have been collected from different summarized reports collected from afar, Amhara National Regional State Science Technology and information communication commission(ANRS STICC), Afar Region, and Harari regions. For misspelled words, the intended valid Amharic word was given manually. The evaluation result is presented in table 4.8. 4.3.2. Experiment The experiment has been conducted to measure the effectiveness of the Amharic language spell checker. As mentioned in section 4.3.1 we used precision and recall to measure the
  • 61. 46 accuracy of the Amharic language spell checker. Five experiments (1, 2, 3, 4 and 5) done to evaluate the accuracy of the the syetm. I. Experiment 1 For this experiment, the text was taken directly from Amhara Science Technology and information communication commission 2009 annual report. Then the text was checked against the lexicon to evaluate the accuracy of the system. The data for this experiment have 199 Amharic words out of which 5(2.5%) words are misspelled, from these all the misspelled words are detected by the system. But one correct word marked as misspelled word by the system. That means one correct word detected by the system as misspelled word. All words in the sample data recognized as misspelled by the spell checker system are automatically flagged. The following figure 4.5 shows screen shoot taken from the output of the spelling checker tested using sample data and table 4.1 shows the result of the experiment 1.
  • 62. 47 Figure 4.2 Sample text screen shot of Experiment 1 Results of Experiment 1 As described in the above formula the result of precision and recall shown below Table 4.1 Evaluation Result for Experiment 1 Results Precision Recall True positive (TP) 193 TP/(TP+FP)x100 =193/(193+0)x100 =100% TP/(TP+FN)x100 =193/(193+1)x100 =99.4% False negative(FN 1 False positive(FP) 0 True negative(TN) 5 Experiment 2
  • 63. 48 To see the performance of the system, we have increased the data taken from Amhara national regional state science technology and information communication commission annual report. The sample data for this experiment have 840 Amharic words out of which 16(1.9%) words are misspelled, from these all the misspelled words are detected by the system. But 20(2.4%) correct word marked as misspelled word by the spelling checker. The evaluation result of experiment 2 was shown in table 4.2. All words in the sample data recognized as misspelled word by the spell checker are automatically flagged. The screen shoot taken from the output of the spelling checker tested using sample data are shown in appendix D. As we observed in experiment 2, the number of test data increases the false negative also increases. Table 4.2 Evaluation Result for Experiment 2. Results Precision Recall True positive (TP) 804 TP/(TP+FP)x100 =804/(804+0)x100 =100% TP/(TP+FN)x100 =804/(804+20)x100 =97.6% False negative(FN 20 False positive(FP) 0 True negative(TN) 16 II. Experiment 3 For this experiment 3, the text was taken directly from afar Region ICT 2009 annual report, it consists of 181 Amharic words out of which 7(3.9%) words are detected as misspelled. But three correct words are marked as misspelled word. The screen shoot is presented in figure 4.6. The evaluation result is presented in table 4.3. Table 4.3 Evaluation Result for Experiment 3 Results Precision Recall True positive (TP) 171
  • 64. 49 False negative(FN 3 TP/(TP+FP)x100 =171/(171+0)x100 =100% TP/(TP+FN)x100 =171/(171+7)x100 =96.1% False positive(FP) 0 True negative(TN) 7 Figure4.3 Sample text screen shot of Experiment 3 III. Experiment 4 Similar to Experiment 1, Experiment 2 and Experiment 3, the text was taken directly from Harari Region ICT 2009 annual report; it consists of 94 Amharic words out of which 9(9.5%) words are detected as misspelled and correct words are marked as misspelled word by the system.
  • 65. 50 As shown in figure 4.7, the screen shoot taken from the output of the spelling checker tested using sample data. The precision and recall evaluation result for experiment is shown in table 4.4 below. Table 4.4 Evaluation Result for experiment 4 Results Precision Recall True positive (TP) 79 TP/(TP+FP)x100 =79/(79+0)x100 =100% TP/(TP+FN)x100 =79/(79+9)x100 =89.8% False negative(FN 6 False positive(FP) 0 True negative(TN) 9 Figure4.4 Sample text screen shot of experiment 4 III. Experiment 5 Similar to experiment 1 and 2, 3we used precision and recall to measure accuracy of the Amharic spell checker. To compare the result calculated from the output of the system to manual checked by experts, the same data used for Experiment 4 was evaluated by Language Expert. He identified, total number of words 94 similar to experiment 4, invalid
  • 66. 51 or misspelled words 7(7.4%) it decreases by two compared to Experiment 4 evaluated by the system, correct words but he marked as misspelled word is zero and finally total unidentified word by the expert is two. Expert Evaluation To compare the accuracy of the system the same dataset is given to evaluate by the language expert. The expert identified 87 correct, 7 misspelled and 0 incorrectly marked as misspelled. The precision and recall evaluation result for experiment is shown in table 4.5 below. Table 4.5 Evaluation Result for experiment 5 Expert System Results Precision Recall Results Precision Recall True positive (TP) 87 TP/(TP+FP)x100 =87/(87+2)x100 =97.75% TP/(TP+FN)x100 =87/(87+7)x100 =92.55% 79 TP/(TP+FP)x100 =79/(79+0)x100 =100% TP/(TP+FN)x100 =79/(79+9)x100 =89.8% False negative(FN) 0 6 False positive(FP) 2 0 True negative(TN) 7 9 As it can be seen in the table, there is a difference between the system and the expert. The reason of the difference is evaluated in experiment 6. 4.3.3. Discussion Generally, the above experiments (1, 2, 3, 4, and 5) are summarized in table 4.6 below.
  • 67. 52 Table 4.6 Experiment result summery Exp erim ent Total Number of words Total Misspelled words Total Detected Misspelled words Total Undetected misspelled words Correct words but marked as misspelled words Precision Recall 1 199 5 5 0 1 100 99.4 2 840 16 16 0 20 100 97.6 3 181 7 7 0 3 100 96.1 4 94 9 9 0 6 100 89.8 5 94 9 7 2 0 97.75 92.55 As shown in table 4.7 the overall performance measure determines how accurate a spelling checker is and calculated using the following formula [33]. P = (Tp+Tn)/(Tp+Tn+Fp+Fn) Table 4.7 Average performance calculated from overall performance of each Experiment Experiment TP TN FP FN Precision Recall Overall Performance 1 193 5 0 1 100 99.4 99.5 2 804 16 0 20 100 97.6 97.62 3 171 7 0 3 100 96.1 98.34 4 79 9 0 6 100 89.8 93.62 5 87 7 2 0 97.75 92.55 97.92 Average Performance 97.4 Based on the evaluation done in all experiments, 1408 words were taken from different sources; out of 1408 words, 46 words are misspelled, 44 misspelled words are detected and 2 misspelled words undetected by a language expert, 30 correct words detected or marked as misspelled word. As shown in Table 4.7, the result of precision of the system is more accurate than the result checked manually. And the average recall and precision of the system tested in Experiment
  • 68. 53 1, 2, 3, and 4is 95.75, 100 respectively. Compared to Experiment 5 tested manually by language expert is 92.55, and 97.75. So based on the result the system performance is better. And the overall performance of the system is 97.27%. As we can see from the experiment, we observe that Amharic spell checker lacks completeness which is indicated by Recall in all experiments. So we try to check the reason of lack of completeness by randomly taking experiment 4from all experiments and try to exhaust the affix rules in experiment below. Experiment 6 We select Experiment 4 randomly, as presented in experiment 4, the text was taken directly from Harari Region ICT 2009 annual report and it consists of 94 words. As shown in figure 4.8 below, the screen shoot taken from the output of the spelling checker tested by add affix rules and words in the lexicon. The precision and recall evaluation result for experiment is presented in table 4.8 below. Table 4.8 Evaluation Result for experiment 6 Results Precision Recall True positive (TP) 84 TP/(TP+FP)x100 =84/(84+0)x100 =100% TP/(TP+FN)x100 =84/(84+1)x100 =98.89% False negative(FN 1 False positive(FP) 0 True negative(TN) 9 The overall performance of the system based on experiment 6 P = (Tp+Tn)/(Tp+Tn+Fp+Fn) =(84+9)/(84+9+0+1+)% =98.93% compared to experiment 4 experiments 6 is better performance. So as we can see in table 4.8 the false negative reduced from 6 to 1. Based on this experiment the reason of lack of completeness is:
  • 69. 54 1. Affix rules defined in the development are not exhaustive. 2. Complexity of Amharic language, all words are not included the in dictionaries. So to enhance the performance of the system it is better to exhaust the above mentioned problems. Figure 4.5 Sample text screen shot of experiment 6 CHAPTER FIVE CONCLUSIONS AND RECOMMENDATIONS 5.1. Conclusions
  • 70. 55 Document preparation is one of the main tasks in government and non government organizations. A spelling error may occur when people use text processing application. Hence, text processing application software has integrated spell checkers, and grammar checkers for some languages. But, for Amharic text processing tools are not integrated. Thus, it is common to find various Amharic books and newsletters that are published with misspelled words. This research has been done to design and develop a spell checker tool for Amharic texts. It involved study spelling errors that can occur in Amharic text writing and development of Amharic spell checker. In addition, we adopted word formation rules for Amharic language which can be integrated to the lexicon used by Amharic spell checker. This lexicon was compiled from ጌታሁን አማረ [26], ባየ ይማም [20], and concise Amharic dictionary, the lexicon list of names, and list of countries. We demonstrate by integrating to open office in the development of Amharic spell checker. The Amharic electronic text spell checker integrated to open office word processor in as- you-type mode, word formation and lexicon dependent design type. It is also a word level spell checker particularly non-word error detector spell checker. That is, it does not consider real word errors, grammatical error and white space. It is a customized version of Hunspell spell checker. The algorithms and the architecture are inherently dependent on Hunspell spell checker. In this work we added some new features that are not addressed in previous works. These are internal inflected words, repeated words stated in previous researchers[3,4] are included, dictionary that does not require transliteration when token is accepted to process spell checking and when suggestion lists are generated. The usage of Unicode data is supposed to increase performance of spell checking by avoiding transliteration. Finally we try to measure the performance of the system by taking 5 experiments and calculating the recall and precision. Then we got the overall performance of the system is 97.27%.And finally recommendations are shown in section 5.2. 5.2. Recommendations The following recommendations are made for further research and improvement.
  • 71. 56  Amharic documents display real word errors in addition to non-word spelling errors. Hence, there is a need for detection and correction of real word errors that can occur in Amharic documents;  Dialectic variations, false geezims; assimilation and Alternations are sources of error variations which are not done in this thesis work. If there is a method that handles these issues in our input component, the performance might be better;  The performance of spelling error detection and correction algorithm, which is edit distance, need to be compared with other identified spelling error correction techniques;  Integrating this work with other Amharic NLP works like; • Amharic search engine applications • Amharic speech synthesis applications  Automatic spelling error correction and suggestion. REFERENCE [1] Shewangizaw Gulilat, DESIGN AND IMPLEMENTAION OF SPELL CHECKER FOR AMHARIC. ADDIS ABABA, February, 2009. [2] ANRS Plan Comission, "Development Indicator of Amhara National State," p. 83, 2017.