Your SlideShare is downloading. ×
Academy of Graduate Studies
Tripoli - Libya

PART OF SPEECH TAGGING OF ARABIC
TEXT
By
Massaoud Abuzed Abolqasem Abuzed

Ma...
Abstract
Part of speech tagging is an important area of research in natural language
processing. Although it has been well...
Acknowledgements
I would like to express my gratitude to:
Associate professor Mohamed Arteimi, my academic supervisor, who...
List of Tables
Table
Table (4-1)
Table (4-2)
Table (4-3)
Table (5-1)
Table (5-2)
Table (5-3)
Table (5-4)
Table (5-5)
Table...
List of Figures and Illustrations
Figure

Figure (2-1)
Figure (3-1)
Figure (3-2)
Figure (3-3)
Figure (3-4)
Figure (3-5)
Fi...
Contents
Abstract……………………………………………………………………………………………
Acknowledgements………………………………………………………………………………..
List of Tables
List ...
4.4.2 Contextual Rules…………………………………………………………
4.5 Testing……………………………………………………………………………….
Chapter Five: Results and discussi...
Chapter One
Introduction
1.1 Background
It is very hard, or even impossible to encode manually all the information needed ...
Part-of-speech (POS) tagging means taking a text written in a human language and
identifying its lexical and/or syntactica...
1.2 Part-Of-Speech Tagging Methods
It has recently become clear that extracting linguistic information from a sample text
...
ii. The Xerox tagger [38] developed by Doug Cutting, which
achieved an accuracy of 96%
3-

Hybrid taggers: those use a com...
1.3.2

Neural Networks

Neural networks (NNs) have been widely explored in Artificial Intelligence and
they have been stud...
word type in the text data. Incorporating linguistic information in the context vectors
can enhance the results.
1.3.4

Tr...
part of his attention to preparing a large enough truth corpus tagged with the chosen
tagset, a task which is tedious and ...
Another set of tests was performed on the original tagset as well, where we noticed
that very little gain in accuracy was ...
explicitly, as is the case for C under Unix. So at last we decided to switch to Unix, a
task that also had many obstacles ...
1.7 Chapters summary
Chapter two gives a literature review of tagging, and talks about taggers, and
different tagging stra...
Chapter Two
Literature Review
2.1Corpora in European languages
In European languages and some other languages, there are m...
texts, some of them in several manuscripts, which adds to a total of 289 texts
and close to three million word forms. Thes...
parses showing rough syntactic and semantic information - a bank of
linguistic trees. They also annotate text with part-of...
A second element that makes analysis of Arabic more complex than other languages
is the fact that the language is usually ...
Figure 2-1) she enclosed in her paper [16], it seems that the corpus is not well built,
since there are in that short pass...
It is worth mentioning here, that mistakes in manually tagged corpora are very
unfavorable, since these corpora are consid...


Freeman from the department of near eastern studies at the university of
Michigan [11] reported that he is attempting t...
Chapter Three
Design
3.1 Tagsets and the adapted Arabic tagset
3.1.1

Tagsets

As mentioned in section 2.1, tagging requir...
Tag

Discription

Example

VBP

Base present

Take

VB

Infinitive

To Take

VBD

Past

Took

VBG

Present particible

Tak...
3.1.2

The adopted tagset

This section describes the tagset adapted for our work. The tagset is based on the
Khoja tagset...
• Singular, masculine, accusative, common noun such as ktab “book” in the sentence
‘>x* alwld ktaba’ “the boy took a book”...
The linguistic attributes of nouns, adjectives, and numerals, that have been used in
this tagset are:

(i) Gender:

M [mas...
(i) Gender:

M [masculine]

F [feminine]

(ii) Number:

Sg [single]

Pl [plural]

Du [neuter]

(iii) Person:

1 [first]

2...
3.2 Corpora used for this work
Early in our work, we were faced by the unavailability of corpora for MSA text.
Even the on...
Original
corpus

Review for
errors, typing
mistakes, etc
(manual)

Convert to Brill
format
(maual)

Transliterate
(C progr...
annotator to make it better resemble the truth. Each transformation has two components:
a rewrite rule and a triggering en...
process needs an initially annotated text. The input to the initial state annotator is an
untagged corpus, a running text,...
rules: each unknown word is first tagged with a default tag and then the lexical rules are
applied in order.
There is one ...
Unanotated
corpus

Initial
tagging

anotated corpus

lexical
Rules

Lexical
tagger
No
Finished
rules?
yes

Context
Rules

...
3.4 Testing strategies
Testing was done using the method of cross validation. Taking in consideration that
we do not have ...
Chapter Four
Implementation and Testing
4.1 Corpus
The corpus used for this study is part of an about 160,000 word corpus ...
.‫ وتهجاوزاقوم ام ج داهتااعناسةع امكواراتادوالر‬

‫ انامشرريوااالر زارقاال اضرراابال رراءاال رريفاالهجارفر ايفامياكررزااا...
results are obtained and/or enough time is spent on this point. At the present
a truth corpus of over 38,000 words is reac...
4.2 Tagset
The tagset used in this work is a modified version of the tagset designed by Khoja,
fully described in [16] and...
b- Using different tags for the different plural forms, and hence the indication of
plural nouns is given the subtags PlbM...
For verbs the modification include:
Using distinct tags for defected verbs (

‫ األفعالاالناقص ا‬to capture the action the...
Making more distinctions is left for future work after studying more deeply the
need for such refinement.
It should be kep...
In this section we give a list of the resulting rules and explain how they are
interpreted, and the actual lexical and con...
learning and the third (about 13,000 words) for evaluation,
and the average of the accuracy for the three tests is taken
a...
No.
1.

Rule
al haspref 2 NASgFGD

Meaning
if a word has a prefix of two letters “al” then tag it as

Comments
“al” is a s...
12.

NCSgMGI t fhassuf 1 VPSg3F

b deletepref 1 PPr_NCSgMGI

followed by “al” for definiteness

Any word starting with “ll...
No.

Rule

Meaning

1.

NASgFGD d fhassuf 1 NCSgMGD

If a word tagged as NASgFGD ends with “d” tag it as NCSgMGD

2.

NCSg...
No.

Rule

Meaning

1.

NCSgFAI NCSgFGI PREV1OR2TAG PPr

Change a tag from NCSgFAI to NCSgFGI if
one of the two previous w...
11.

NASgFGD NCSgFGD PREVTAG NCSgMGI

Change a tag from NASgFGD to NCSgFGD
if the previous word is tagged NCSgMGI

‫متووزا...
Chapter five
Results and discussion

5.1 Results
Below are the results for the performed tests. Each table illustrates a g...
Test

Training
size(words)

Test
size(words)

Test7
Test8
Test9
Average

31422
31467
31634
-

6261
6216
6049
-

No.
lexica...
Error
type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Meaning
Interpreting a title as a common noun
Mistagging a broken plural
I...
wzyr
wEql
mtHdvyn
w>hm
myna
>n
w>n
mst$ar
waDHp
tktml
aldktwrp
>Hyana
>bwabha

NTSgMGI
PC_NP
NAPlmAI
PC_NASgMGI
RF
PA
PC_P...
kmHwr
bSnaEp

PPr_NCSgMI
PPr_NCSgFI

NCSgMI
PPr_NCSgMD

8
4,13

stqwm

PA_VISg3F

PA_VISg3M

13

Tyran
kEaml
rqmyn
wtzyd
>...
3. Distinction between sound masculine plural and dual nouns is not easy for
unknown nouns in Genitive and Accusative case...
words, for lexical and contextual learning respectively. Had we have a ready
corpus to work with matters would be differen...
ambiguous words comprise a maximum of 3% of the test corpora, and we do not
know the performance accuracy for the rest of ...
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
Upcoming SlideShare
Loading in...5
×

part of speech tagger for ARABIC TEXT

1,084

Published on

part of speech tagger for arabic text

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,084
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
42
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "part of speech tagger for ARABIC TEXT"

  1. 1. Academy of Graduate Studies Tripoli - Libya PART OF SPEECH TAGGING OF ARABIC TEXT By Massaoud Abuzed Abolqasem Abuzed March 2006
  2. 2. Abstract Part of speech tagging is an important area of research in natural language processing. Although it has been well studied in several Indo-European languages, it is still not very well investigated with respect to Arabic. In this thesis, the Brill tagger and a modified version of the Khoja tagset, along with a corpus prepared for this purpose, are applied to tag Modern Standard Arabic (henceforth MSA) text. The Brill tagger is a famous public domain part of speech tagger, originally designed for tagging English text by implementing machine learning approach through the method of transformation rules. It has been adopted for other languages, such as German and Hungarian, by many researchers. Some modifications need to be done to the learner and tagger that are written partly in Perl and partly in C programming languages, and are run under the unix/linux operating system. The main change is done on the initial state tagger, which is used by both learner and tagger. A program is written using the lexical analyzer Lex to capture Arabic morphological structures, and then interfaced with both learner and tagger. The tagset used in this work is a revised version of that introduced by Khoja. The revision included changing some of the tags for linguistic considerations and introducing some new tags to make the set more powerful, or to make up for limitations in the original tagset that hinder tagging some words. The corpus is obtained from two Jordanian magazines, and has to go through a series of editing steps. A collection of lexical rules and contextual rules are produced by the learning system, and applied to Arabic text. The tagging accuracy of the resulting tagged text is measured to be approximately 84% for both known and unknown words. A result which may seem low, but taking in consideration the complexity of the language, the richness of the tagset, the fact that this work is the first work that encompasses such a tagset for Arabic, and the fact that we did not have a reference corpus to base our work on, we consider the results very promising. 2
  3. 3. Acknowledgements I would like to express my gratitude to: Associate professor Mohamed Arteimi, my academic supervisor, who guided me through this research and gave me his valuable advices. The Department of Computer Science in the Academy of Graduate Studies and personally to Dr. Abdussalam Elmusrati for his encouragement and help. The Academy of Graduate Studies, and to Dr. Saleh Ibrahim for his encouragement by sponsoring this research through an academic scholarship. And to my family and friends for their support and endurance. 3
  4. 4. List of Tables Table Table (4-1) Table (4-2) Table (4-3) Table (5-1) Table (5-2) Table (5-3) Table (5-4) Table (5-5) Table (5-6) Table (5-7) Table (5-8) Page A list of lexical rules …………………………………………………… Examples of misleading lexical rules ……………………………………. A list of contextual rules ………………………………………………… Accuracy for the original tagset …………………………………………… Accuracy for the complete modified tagset ……………………………… Accuracy for the complete modified tagset with enlarged training corpora Accuracy for the ungrammatized modified tagset ……………………… Types of errors …………………………………………………………… A sample of errors in grammatized tests ………………………………… Percentage error for each error type in the grammatized tests …………… A sample of errors in ungrammatized tests ……………………………… 4 40,41 42 43,44 45 45 46 46 47 47.48 48 48,49
  5. 5. List of Figures and Illustrations Figure Figure (2-1) Figure (3-1) Figure (3-2) Figure (3-3) Figure (3-4) Figure (3-5) Figure (3-6) Figure (3-7) Figure (4-1) Figure (4-1) Figure (4-2) Figure (4-3) Figure (4-4) Page copy of the manually tagged excerpt sought by Khoja …………… Example of a general classification tagset …………………………… Example of a detailed tagset for verbs …………………….………… The entire Penn Treebank tagset …………………….……………… Preliminary steps for tagging …………………….………………… Lexical rule learning ..………………….…………………………… Context rule learning ………………….…………………….……… Tagging …………………….…………………….………………… (a) A sentence from the corpus …………………….……………….. (b) A transliteration of a sentence from the corpus ………………… Tagged and detransliterated sentence from the corpus ……………… Tags of plurals …………………….…………………….………… Tags of defected verbs …………………….……………………… 5 15 18 19 19 25 27 28 33 33 33 33 35 36
  6. 6. Contents Abstract…………………………………………………………………………………………… Acknowledgements……………………………………………………………………………….. List of Tables List of Figures and Illustrations Contents…………………………………………………………………………………………… Chapter One: Introductoin……………………………………………………………………... 1.1 Background………………………………………………………………………… 1.2 Part-Of-Speech Tagging Methods ………………………………………….. 1.3 Machine learning in POS tagging……………………………………………… 1.3.1 N-gram and Markov models……………………………………………. 1.3.2 Neural Networks……………………………………………………….. 1.3.3 Vector-based clustering………………………………………………… 1.3.4 Transformation-Based Learning………………………………………... 1.4 Aims and objectives………………………………………………………………… 1.5 Tools used in this work……………………………………………………………… 1.5.1 Corpus…………………………………………………………………… 1.5.2 Tagset……………………………………………………………………… 1.5.3 Tagger…………………………………………………………………… 1.6 Testing strategy …………………………………………………………………….. 1.7 Chapters summary…………………………………………………………………… Chapter Two: Literature Review……………………………………………………………… 2.1 Corpora in European languages ……………………………………………………... 2.1.1 General Corpora…………………………………………………………. 2.1.2 Historical Corpora………………………………………………………. 2.1.3 Annotated Corpora……………………………………………………… 2.2 Arabic corpora……………………………………………………………………….. 2.4 Arabic taggers………………………………………………………………………... 2.5 Definition of training and testing texts……………………………………………... Chapter Three: Design………………………………………………………………………….. 3.1 Tagsets and the adopted Arabic tagset…………………………………………. 3.1.1 Tagsets……………………………………………………………………. 3.1.2 The adopted tagset……………………………………………………….. 3.2 Corpora used for this work………………………………………………………. 3.3 The Brill system……………………………………………………………………. 3.3.1 Learner……………………………………………………………………. 3.3.2 Tagger…………………………………………………………………….. 3.4 Testing strategies…………………………………………………………………… Chapter Four: Implementation and Testing……………………………………………… 4.1 Corpus……………………………………………………………………………….. 4.2 Tagset……………………………………………………………………………….. 4.2.1 Nouns……………………………………………………………………... 4.2.2 Verbs……………………………………………………………………… 4.2.3 Particles…………………………………………………………………... 4.3 The program………………………………………………………………………… 4.4 Rules…………………………………………………………………………………. 4.4.1 Lexical Rules 6 i ii iii iv v 1 1 3 4 4 5 5 6 6 6 6 7 8 9 10 11 11 11 12 12 13 16 17 18 18 18 20 24 24 24 29 30 31 31 34 34 35 36 37 37 38
  7. 7. 4.4.2 Contextual Rules………………………………………………………… 4.5 Testing………………………………………………………………………………. Chapter Five: Results and discussion…………………………………………………………. 5.1 Results……………………………………………………………………………….. 5.2 Examples of errors in tagging…………………………………………………….. 5.3 Discussion…………………………………………………………………………... 5.4 Evaluation…………………………………………………………………………... 5.5 Accomplishments………………………………………………………………….. Chapter Six: Conclusions and Future work…………………………………………………. 6.1 Conclusion………………………………………………………………………….. 6.2 Future work………………………………………………………………………... References……………………………………………………………………………………….. APPENDIX A APPENDIX B APPENDIX C Sample tagged sentences as compared to the truth corpus…………. The complete tagset (Tagset2)……………………………………… The Lex file used for initial state tagger……………………………. 7 38 38 45 45 46 49 51 52 53 53 53 54 58 62 78
  8. 8. Chapter One Introduction 1.1 Background It is very hard, or even impossible to encode manually all the information needed to encode a human language that is necessary to build a system that will annotate text with structural description [9]. Such a work would need a lot of information concerning the type of grammar which will be used, plus a great deal of the morphological, lexical, and syntactical information about the language itself and encoding them in an algorithmic way for the intended system to handle them. However, this is not an easy task and would consume a lot of time and probably a group of language experts to be achieved. Even if achieved, it would be language specific and could not be applied to different languages. For this reason, language processing is tackled in different approaches recently. One of the most growing approaches is through machine learning techniques. These techniques start by giving samples of manually annotated text, which should be reviewed very carefully to make sure it resembles the truth for the given language. Then, a learning system is applied to that text to figure out the cues for annotating the given words with the given annotation. These cues are then converted either to some statistical information stating the probabilities of assigning a given annotation to a certain word according to its lexical structure and/or its location in the context, or they are converted to a collection of rules stating when and why to assign a given annotation to the word. Afterwards, another system, the tagger, is given another raw text to be annotated, and would go through the text and assign annotations to the words according to the accompanying cues (probability figures, or rules). Clearly the use of rules obtained from a learning system is more favorable over the use of probability figures for the following two reasons: 1- Rules are easy to understand and can reflect directly the human understanding of the language. 2- Rules can be manipulated through changing, omitting, or adding some rules. when doing so would enhance the annotation ability of the system. For these reasons, we have chosen to use a rule-based machine learning system for our work. 8
  9. 9. Part-of-speech (POS) tagging means taking a text written in a human language and identifying its lexical and/or syntactical structure by assigning to each word/token in the text the correct Part-of-Speech such as noun, verb, adjective or adverb. Furthermore, the tags give, in many cases, additional features, such as number (singular/plural), tense, and gender, thus changing the raw (unannotated) text to annotated or tagged corpus. This process of tagging requires a set of tags that classify words according to their lexical and syntactical meanings. This set is referred to as a tagset. Part-of-speech tagging is the foundation of natural language processing (NLP) systems, and thus has been an active area of research for many years [25]. The use of corpora has become an important issue in Language Engineering (LE), the field that deals with all different types of handling natural languages computationally. There are many ways to deal with corpora. These ways include the use of one language corpora, that are annotated to reflect some information about the language structure, and parallel corpora, i.e. corpora of the same text written in two or more different languages, where at least one of the corpora is annotated, to help annotate the other corpora, or to help extract some information from the other corpora. Both kinds are valuable sources of linguistic metaknowledge, which forms the basis of techniques such as tokenization, POS tagging, morphological and syntactic analysis, which in turn can be used to develop LE applications [9]. An annotated corpus is a corpus that has had some level of linguistic detail added to the raw data. For example, the Penn Treebank [41] is an annotated corpus, because it contains the linguistic structure and part-of-speech tags for the words in the corpus. A tagged corpus is more useful than an untagged corpus because there is more information there than in the raw text alone. Once a corpus is tagged, it can be used to extract information from the corpus. This can then be used for creating dictionaries and grammars of languages using real language data. Tagged corpora are also useful for detailed quantitative analysis of text [22]. Other applications of Part-of-speech tagging include speech recognition [14], enhancing input methods [6], machine translation [24], and discovering errors in OCR files [20]. 9
  10. 10. 1.2 Part-Of-Speech Tagging Methods It has recently become clear that extracting linguistic information from a sample text corpus automatically can be an extremely powerful method for making accurate natural language processing systems [9]. There are several part-of-speech taggers that are widely used for Indo-European languages, all of which are trained and retrainable on text corpora. Structural ambiguity can be greatly reduced by adding empirically derived probabilities to grammar rules and by computing statistical measures of lexical association. Word sense disambiguation can, in some cases, be done with high accuracy when all information is derived automatically from corpora. An effort has recently been undertaken to create automated machine translation systems, where the linguistic information needed for translation, is extracted automatically from aligned text corpora [22]. These are just some of the recent applications of corpus-based techniques in natural language processing. Along with great research advances, the infrastructure is in place for this line of research to grow even stronger. With on-line corpora, the use of the corpus-based natural language processing is growing, producing better performance, and becoming more readily available. There is a worldwide trend to annotate large corpora with linguistic information, including parts of speech. Many techniques have been used to tag English and other European language corpora, such as: 1- Rule-based technique: used by Greene and Rubin in 1970 to tag the Brown corpus. They designed the tagger TAGGIT [13] that used context-frame rules to select the appropriate tag for each word. It achieved an accuracy of 77%. More recently, interest in rulebased taggers has re-emerged with Eric Brill's tagger, which used another type of rules called transformation rules (Section 3.3) and achieved an accuracy of 97.5. 2- Hidden Markov models were used in the 1980s to select the appropriate tag. Example of such taggers are: i. CLAWS [12], which was developed at Lancaster University and achieved an accuracy of 97% 01
  11. 11. ii. The Xerox tagger [38] developed by Doug Cutting, which achieved an accuracy of 96% 3- Hybrid taggers: those use a combination of both statistical and rule-based methods. This method achieved an accuracy of 98% as reported by Tapanainen and Voultilainen [31] who used both techniques separately, then aligned the output. 1.3 Machine learning in POS tagging Machine learning deals with acquiring knowledge from an environment in a computational manner, in order to improve the performance. There are many factors that contributed over the past couple of decades to the blending of ML and NLP. These factors include the ever expanding availability of large corpora, more powerful computing resources; and a greater demand for natural language based applications [27]. This led to the use of many machine learning techniques in natural language processing, and in particular in Part-of-speech tagging[34]. Since the method we are using in our work belongs to these techniques, we shall give here a more detailed idea about these methods. 1.3.1 N-gram and Markov models A Markov model of a sequence of states or symbols (e.g. words or Part-of-speech tags) is used to estimate the probability or “likelihood” of a symbol sequence. It can be used for disambiguation, e.g. for choosing the most likely tag for an ambiguous word in a given context, by estimating the probability of every candidate sequence. A Markov model applies the simplifying assumption that the probability or “likelihood” of a long sequence or chain of symbols can be estimated in terms of its parts or n-grams. Hidden Markov Models (HMMs) [18] are a variant of Markov models including two layers of states: a visible layer corresponding to input symbols (e.g. words) and a hidden layer learnt by the system, corresponding to broader categories (e.g. wordclasses). Markov or n-gram models have been widely used for Part-of-speech tagging, following the successful use in tagging the LOB Corpus [19]. 00
  12. 12. 1.3.2 Neural Networks Neural networks (NNs) have been widely explored in Artificial Intelligence and they have been studied for many years hoping to achieve human-like performance in many fields. There are many rules used in the learning process of neural networks. The type of learning in a neural network is determined by the manner in which the parameters change. This can happen with or without the intervention of a supervisor; hence, the neural networks are divided into three groups: supervised learning, unsupervised learning, and reinforcement learning. Neural networks typically consist of multilayers of nodes, where the lowest layer is the input layer, the highest is the output layer, and the layers in between are the hidden layers. Nodes of adjacent layers are connected via weighted links. The weights on these links are manipulated using a special function, so that the given input produces the desired output. When this stage is reached, the given weights on the links are recorded, or learnt as the proper values for the given input to produce the desired output. In part of speech tagging applications, the input consists of all the information the system has about the parts of speech of the current word, i.e. all its possible tags, the tags of certain number (p) of the preceeding words, and the tags of another number (f) of the following words. The output of the network would be the appropriate tag of that word in this context, and the weights on the links would be adapted accordingly. When the learning process is done, the tagger will have a huge number of weights, along with their tag sequences, to be applied to tag new texts. 1.3.3 Vector-based clustering This approach uses co-occurrence statistics to construct vectors that represent word classes or meanings by virtue of their direction in multi-dimensional wordcollocation space. For example, Atwell [4] annotated each word in a sample from the LOB Corpus with a vector of neighboring word-types; words with similar vectors were clustered into word-classes. A method for calculating semantic word vectors is to use random labeling of words in narrow context windows to calculate semantic context vectors for each 02
  13. 13. word type in the text data. Incorporating linguistic information in the context vectors can enhance the results. 1.3.4 Transformation-Based Learning Brill has developed a symbolic Machine Learning method called TransformationBased Learning (TBL) [7,8,9]. Given a tagged training corpus, TransformationBased Learning produces a sequence of rules that serves as a model of the training data. To derive the appropriate tags, each rule may be applied to each instance in an untagged corpus in a specific order. TBL relies heavily on a large annotated training corpus, and reasonable default heuristics to get things started. It learns rules that are clearly coupled to human understanding of a natural language, and allows rules to be easily acquired for different domains or genres. There is a gap between an initial semantic network generated from input data, and a semantic one representing profound knowledge, from which a knowledge database can be constructed. By using transformation rules, the semantic analysis method is based on a pattern matching with a semantic network. Transformation rule description language allows users to manipulate their knowledge base and to define rules. 1.4 Aims and objectives The main purpose of this research work is to produce a system that can correctly tag Arabic words with high accuracy utilizing a set of available tools after modifying them to suit our purposes. These tools are Corpus, Tagset, and Tagger. 1.5 Tools used in this work 1.5.1 Corpus Most of the researches on tagging for other languages have pretagged standard corpora to work on and test the performance of their systems. But for Arabic, the case is different. No standard corpora are available. This doubles the burden on the person who wants to work on this subject; instead of concentrating on the tagger, one has to shift 03
  14. 14. part of his attention to preparing a large enough truth corpus tagged with the chosen tagset, a task which is tedious and time consuming. The lack of easily available standard tagged Arabic corpus was the motivation of this work. At the beginning of this study, the researcher thought to work on morphological analysis of Arabic by machine learning, but reviewing the literature he discovered the unavailability of a dependable tagged corpus, a thing that is one of the basic requirements for such a study. He found that most of the researchers in the field are complaining of this problem. So he decided to start from scratch and work in the direction of providing such a corpus. For this purpose, the researcher started with a raw corpus and made some revisions and a series of automatic taggings and manual corrections until his study reached satisfactory results. Because of time limitations, the size of the corpus reached is moderate and is not as large as what one would wish. The corpus used for this study is derived from a raw corpus whose data are articles of two Jordanian journals, Aldustur, and Aldustur Aleqtesady, but has to go through an extensive preprocessing which will be explained in detail in Chapter Four. 1.5.2 Tagset We adapted the Khoja detailed tagset, a morphosyntactic tagset that is very rich and comprehensive for Arabic, and hence it is hard to deal with, whether manually or automatically. The original tagset consists of 177 tags and is heavily increased by the fact that we do not use a stemmer for the tagging system, and so another group of composite tags is introduced to make up for composite words. These tags can be composites of two, three, or even four basic tags. This tagset was revised introducing new tags and making some refinement of original tags. That included the distinction between plural forms (beneficial for morphological studies), and recognizing defected verbs (beneficial for syntactical studies). This modification raised the number of basic tags to 319. The complete new tagset is shown in Appendix B. Another subset of the resulting tagset is introduced by removing case information, thus gaining two advantages: decreasing the size of the tagset, and more importantly getting rid of some complexity and leading to better accuracy as will be seen in Chapter Five. 04
  15. 15. Another set of tests was performed on the original tagset as well, where we noticed that very little gain in accuracy was achieved by modifying the tagset. But it should be kept in mind that the main purpose of modifying the tagset was not shooting for better accuracy, rather it was looking for clarity of tags and having more features for some of the tags. In fact it was expected to lose part of the accuracy for this reason, and we were willing to sacrifice it. 1.5.3 Tagger The tagger used for this study is the Brill tagger, which will be introduced in detail in Section 3.2, a tagger that is based on the transformation rule method. This tagger was originally designed for tagging English text, and had been adopted by many researchers for other languages like Hungarian [23], and German [28,33]. The reasons for choosing this tagger are: 1. The source code is available, and written mostly in a common language (C), which makes the modification possible. 2. It is based on transformation rules, which makes it possible to adapt to other languages. 3. The use of transformation rules also makes it easy to understand the underlying reasons behind choosing certain tags (see Section 4.4), and easy to modify the rules and/or omit some of them if needed. This is in contrast to using statistical taggers (Section 2.2), where information is converted into a huge set of numbers, representing the probabilities of choosing a specific tag for each word. A lot of work has to be done for adapting the tagger to our purposes, which includes: 1. Manually tagged Arabic corpus has to be prepared, since we have to start from scratch. This corpus is then enlarged in many steps. 2. Since the original system is written for Unix, and makes use of some of the facilities thereof, we first attempted to convert it to the DOS environment, being more common to us, and in our academic environment here. A lot of work was done in this direction but many problems were encountered. The latest and hardest of which was the fact that Turbo C under DOS did not deal with the extended RAM 05
  16. 16. explicitly, as is the case for C under Unix. So at last we decided to switch to Unix, a task that also had many obstacles in the beginning, but worked out smoothly at the end. However, we still have an ambition, even after the completion of this project, to switch back to DOS/Windows, and attempt to get a working DOS/Windows version. 3. The original code mixed between the use of C in most of its parts, and Perl in some others, especially the lexical learner, which we had to work on. Perl is a new language for the researcher, therefore some work had to be done in this direction, first by learning as much as possible and needed from Perl, and then making use of that in making an efficient change to the learner, to make use of the program generated by Lex for the lexical analysis of the corpus. The problem that took most of our time and effort in this is the fact that the exact same changes had to be done on both the learner, which is written in Perl, and the tagger, written in C. 1.6 Testing strategy Testing was done using the method of cross validation. Because of unavailability of a standard reference (truth) corpus, we have to be satisfied by a rather small corpus for this purpose. The corpus we prepared for learning was divided into three parts, and three tests had to be performed, each of which utilizes two thirds of the whole corpus for training and the other third for testing, changing the parts every time. Then taking the average of the results. At this stage we used a total corpus of 38,000 words, so every test involves about 25,000 words for training and 13,000 words for testing. This whole experiment is done three times: one time for the original tagset, the second for the modified tagset, and the third for the ungrammatized tagset. That means three sets of corpora and three learning/tagging systems, each using the appropriate tagset, are prepared. The rather small size of the corpus is justified by the lack of standardly tagged corpus. This is the best we could reach within our available time and efforts, and we think we achieved very promising results that can be enhanced by many improvements, including the enlargement of the learning corpus. This work is probably the first real step in the direction of having a standard Arabic corpus tagged with a rich and comprehensive tagset, not forgetting the contribution of Khoja who provided the baseline for our work. 06
  17. 17. 1.7 Chapters summary Chapter two gives a literature review of tagging, and talks about taggers, and different tagging strategies, with concentration on the efforts exerted on Arabic, in terms of the three parts of a tagging system; corpora, tagsets, and taggers. Chapter three, talks about the original tools that are chosen for this work, namely, the Khoja tagset, and the Brill tagger, giving a detailed idea about their form, and the way they are designed. Then it gives an idea about the strategy used for testing. Chapter four explains our contribution in modifying the tagset, preparing the corpus for work, and adopting the tagger to fit our needs. Chapter five gives the tests and results of our experiments. First it gives average accuracies of each of the three performed tests, then it discusses the types of errors encountered, studies their causes, and suggests solutions to them. Chapter six gives the conclusion of the work and suggests future expansions. 07
  18. 18. Chapter Two Literature Review 2.1Corpora in European languages In European languages and some other languages, there are many famous and standard corpora which are available for researchers, either to be used in extracting information of interest to their fields of study, or as references for testing there tagging strategies. Below is a list of just few examples of such corpora: 2.1.1 General Corpora  The Brown Corpus. Corpus of written American English, and the corresponding British corpus Lancaster-Oslo/Bergen corpus (LOB) [19], a corpus of written British English. The Brown corpus in the 60's, while its British counterpart was compiled in the 70's. Both consist of around one million tokens (i.e. words, counted every time they appear). The Brown corpus was used in seminal linguistic and psycholinguistic research that involved word frequency, and continues to be used today. It comes as text, tagged, and parsed.  BNC: The British National Corpus (BNC) [40,42] is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. Because of it’s large size, and sampling of written and spoken language, the BNC is very good for research involving lexical frequency. For words with very low frequency, they are more likely to occur in a 100 million words corpus than in a 1 million words corpus.  The Amsterdam Corpus (AC): This corpus [30] was compiled in the beginning of the 1980s by a group of scholars directed by Anthonij Dees and has resulted in the Atlas des formes linguistiques des textes littéraires de l'ancien français. The electronic version of the AC was provided by Piet van Reenen (Free University of Amsterdam). It contains about 200 different 08
  19. 19. texts, some of them in several manuscripts, which adds to a total of 289 texts and close to three million word forms. These forms have been manually annotated with 225 numeric tags encoding part-of-speech and other morphological categories (e.g. “566” for verb, future tense, 3rd person, plural). 2.1.2 Historical Corpora  Helsinki Corpus: The Helsinki Corpus of English Texts: Diachronic and Dialectal [39] is a computerized collection of extracts of continuous text. The Corpus contains a diachronic part covering the period from c. 750 to c. 1700 and a dialect part based on transcripts of interviews with speakers of British rural dialects from the 1970's. The aim of the corpus is to promote and facilitate the diachronic and dialectal study of English as well as to offer computerized material to those interested in the development and varieties of language. The uses for such a corpus are fairly obvious: it is used for diachronic research; whether one is interested in lexical frequency, semantics, syntax, etc. This corpus also has a parsed version 2.1.3 Annotated Corpora  Celex. Lexical databases of English, Dutch, and German. [40] This corpus contains ASCII versions of the CELEX. It was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen and the Institute for Perception Research in Eindhoven. This corpus contains detailed information on the orthography, the phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), the morphology (derivational and compositional structure, inflectional paradigms), the syntax (word class, word class-specific subcategorizations, argument structures), and word frequency (summed word and lemma counts, based on recent and representative text corpora). Thus it is useful for various types if linguistic and psycholinguistic research.  The Penn Treebank. The Penn Treebank Project [41] annotates naturallyoccurring text for linguistic structure. Most notably, they produce skeletal 09
  20. 20. parses showing rough syntactic and semantic information - a bank of linguistic trees. They also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation. The Penn Treebank project has annotated the Switchboard Corpus, the Wall Street Journal Corpus, Chinese Journal Project, the Brown Corpus, and the Helsinki Corpus (among others). It is very useful for syntactic research, or any research involving the syntactic/semantic relationships between words. Tgrep (tree-grep) is a useful tool for use with this corpus.  There are corpora available for many other languages. Examples include: American English [35], German [29], Hungarian [23,26], Swedish [5, 25], and Hebrew [15]. More information about all these corpora and others can be easily found on the Internet. 2.2 Arabic corpora A number of electronic Arabic text corpora have been compiled [32] but these corpora are raw, which means that the exploration of these corpora remains problematic. Some analyses that have been conducted on these corpora involve sometimes very limited data. Others have developed proficient word form analyzers, such as the analyzer by the Xerox European Research Centre, but the question remains whether these analyzers provide an adequate solution for the exploration of Arabic tagged corpora. In order to explore corpora in an efficient and in an economically reliable way, some preliminary operations ought to be made [32]. As is generally known, analyzing Arabic corpora is more complex than other corpora because of three main reasons. In the first place the Arabic language is very polysemic. The Arabic language is much more polysemic than, for example, Dutch. In fact in the Dutch language one way to create new words is by adding two words together in order to obtain a new word as a compound. These new words are very widespread, but are also identifiable by a computer in a simple way, i.e. by defining a word as a string of characters between two blanks. In Arabic new meanings for words are often given by expanding the older meaning of an existing word to a new one. This means that the external morphological form of the word does not change, in spite of the fact that the word carries a new meaning. 21
  21. 21. A second element that makes analysis of Arabic more complex than other languages is the fact that the language is usually not vocalized, which means that the degree of ambiguity of words as separate units is much greater than e.g. in English or Dutch. Words, in their raw form, can belong to different grammatical categories as seen in the string of characters "ktb". This string of characters stands for the verb "kataba" (to write) as well as for the plural "kutub" (books). This complicates the searching for words in a corpus of texts. In the third place, the problem is complicated by the fact that in Arabic a number of prefixes and suffixes are directly linked to the word. This makes the searching by computer even more complex. For example the string of characters ‘fhm’ can stand for the verb "fahima" (understood), but it can as well stand for the particle and suffix "fahum" (since they) or for the particle and verb "fahamma" (then he considered). These facts and others are behind the lack of tagged Arabic corpora. One of the researchers in this line [11] noticed “the frustrating reality was that the NLP experts with experience in dealing with European languages and scripts deemed the problem [of providing tagged corpora and taggers for Arabic] trivial and therefore not worth wasting time on. While the available Arabic language experts had no computer experience and deemed the problem impossible to solve and therefore not worth wasting time on it”. This is true to some extent, but what is certainly true is that there are very little available corpora for the Arabic language. There are some large corpora that are available. Unfortunately they are not free. Also, although some of these corpora are marked-up with XML or SGML tags, none of them are POS tagged [32, 37]. There are some efforts towards the preparation of a POS-tagged corpus for Arabic, but they are still in their early and testing stages. One of these works is that of Shereen Khoja [16,17]. Although her work has some limitations and deficiencies, as will be explained below, it is probably the first step towards building an Arabic POS-tagged corpus. She introduced two tagsets, one is very small containing only five classes or basic tags (noun, verb, particle, residual, punctuation), and the other is very comprehensive and appropriate for Arabic, containing more detailed tags (i.e. singular, masculine, definite common noun). She used the first tagset to manually tag 50,000 words of Arabic newspaper text. This type of tagging is obviously of little use, but she also tagged 1,700 words with the second tagset [37]. I sent many email messages to Miss Khoja hoping to get a copy of her tagged corpus and benefit from it, but unfortunately I did not receive any response. However, from the small excerpt (see 20
  22. 22. Figure 2-1) she enclosed in her paper [16], it seems that the corpus is not well built, since there are in that short passage many mistagged items. Mistakes include the following (refer to Section 3.2.2 and to Appendix B to get an idea about meanings of tags): 1. Mistagging adjectives as nouns, example: ‫ الشريفين‬is tagged as NCDuMGD, instead of :NADuMGD. There are many instances of such an error. 2. Case information for nouns seems almost random. Example ‫الر ين اا‬ ‫مبناسرة االور ا‬ is tagged as PPr-NCSgFGI NCSgMAD NCSgMND, instead of: PPr-NCSgFGI NCSgMGD NASgMGD, and ‫ أعري االكرااالير‬is tagged as: VPSg3M NCSgMND NCSgMAD, instead of VPSg3M NCSgMND NASgMND 3. tagging single as plural, like: ‫ لبالده‬is tagged as: PPr_NCPlFGI_NPrPSg3M instead of: PPr_NCSgFGI_NPrPSg3M. Figure 2-1: copy of the manually tagged excerpt cited by Khoja 4. Tagging feminine as masculine, like: ‫ عر اأ كر االهاراي‬is tagged as: PPr NCSgFNI NCPlMND, instead of: PPr NASgMGI NCPlFGD. These are just few examples of the mistakes found in the 48-word passage. Note also that some of the words cited in the above examples contain more than one type of mistakes. 22
  23. 23. It is worth mentioning here, that mistakes in manually tagged corpora are very unfavorable, since these corpora are considered to represent the truth, and are to be used as guidelines for learning systems. If they are not carefully built, the whole system is a failure regardless of how large the reported accuracy may be. We used the same detailed Khoja tagset to tag about 38,000 words, and have three versions of this corpus; one tagged with the original detailed tagset as proposed in [16], the second tagged with a modified tagset of the mentioned tagset as explained in Section 4.2, and presented in Appendix B, and the third tagged with the modified version with the removal of grammatical information. We do not claim perfection but we think that our work, besides being much larger is also much more accurate in applying the tagset to real Arabic text. 2.3 Arabic Taggers Very few people worked towards building a complete tagger for Arabic. The following cases, though not complete, are among the best examples:  Abuliel [1]: in his paper he described some preparatory steps of building an Arabic POS tagger. Rule-based techniques were used for finding phrases, analyzing affixes of the word, and discovering proper nouns. The tagset used in this work is not specified, and no results are reported concerning the overall performance of a tagging system.  Alshalabi et al [3] dealt with vowelized Arabic text and considered recognizing nouns only. This work showed how to discover nouns in the text but does not reach the stage of tagging. The fact that the system is constrained to vowelized text makes it deficient. Although they talked about part-of-speech tagging and gave a survey of taggers, they did not really do any tagging, nor did they give any tagset for this purpose. They reported 95.4% accuracy, which is a good performance rate, but we should keep in mind that the system is constrained to completely vowlized words, therefore minimizing ambiguity, and that it is restricted to discovering nouns, which simplifies the classification task.  Maloney and Niv [21], also worked with names only, in their name recognizing system called TAGARAB. 23
  24. 24.  Freeman from the department of near eastern studies at the university of Michigan [11] reported that he is attempting to adopt the Brill tagger to Arabic. He designed his own tagset for this purpose, started to do some morphological analysis, and explained the hurdles he encountered in that work. According to his paper he did not reach the stage of tagging to report any accuracy rate.  Khoja: the title of her paper [17] may lead to concluding that she has a complete tagger. That deceived us in the beginning of our work. But carefully studying the paper we concluded that she only did some preliminary work in this direction, and is still working on the tagger. This was ascertained by consulting her website [37] where she declares: “As far as I know, a POS tagger has yet to be developed for Arabic, which is why I am developing one myself.” 2.4 Definition of training and testing texts A corpus of over 38,000 words was prepared. Three versions of this corpus are available: one tagged with the original Khoja tagset, the second with a modified tagset as explained in Section 4.1, and the third is tagged with a subset of the modified tagset which excludes the grammar information, as explained in Section 4.5. Each of these corpora is divided into three equal portions, then a cross validation is done three times, using different two thirds of the corpus for training and the other third for testing, each time. The average of the three tests is taken as the estimated performance accuracy of the tagger. This means that nine tests are done in this way. In addition to these tests, three other tests were performed on the corpus tagged with the complete modified tagset, this time to test the effect of enlarging the corpus size on the accuracy of the tagger. To do that, about five sixths of the corpus are used for training, and the other one sixth for testing, for each new test. Then, the average is taken to get an estimate of the overall accuracy. 24
  25. 25. Chapter Three Design 3.1 Tagsets and the adapted Arabic tagset 3.1.1 Tagsets As mentioned in section 2.1, tagging requires a set of tags, which classify the words according to their lexical and syntactical meanings, i.e. a tagset. Tagsets vary in size. The number of tags used by different systems varies a lot. Some systems use fewer than 20 tags, while others use over 400. The larger the size of tagset the more information is carried in each tag. For example we may have a basic tagset, which divides the words into very small set of classes as in Figure 3-1 below. We may enhance this tagging by classifying nouns to single and plural, verbs to present and past, and so on, as shown in Figure 3-2, which lists a subset of a refined tagset showing the different tags that belong to the general class verb in English. And can be further classified as shown in Figure 3-3, which gives a complete list of the Penn Treebank tagset [41]. Tag Discription Tag Discription NN Noun JJ Adjective NNP Proper noun CC Coord conj DT Determiner CD Cardinal number IN Proposition Prp Personal pronoun VB Verb RB Adverb -R Comparative -S Superlative -$ Possisive Figure 3-1: example of a general classification tagset. 25
  26. 26. Tag Discription Example VBP Base present Take VB Infinitive To Take VBD Past Took VBG Present particible Taking VBN Past particible Taken VBZ Present 3sg Takes MD Modal Can, will Figure 3-2: example of a detailed tagset for verbs. Figure 3-3: the entire Penn Treebank tagset 26
  27. 27. 3.1.2 The adopted tagset This section describes the tagset adapted for our work. The tagset is based on the Khoja tagset as mentioned earlier. We introduce the tagset as described by its designer [20]. The modifications that are specific to our work are marked using an asterisk symbol (*), and are further discussed in detail in Section 4.2. The original tagset (Tagset1) contains 177 tags: 103 nouns, 57 verbs, 9 particles, 7 residual, and 1 punctuation. We derived two other tagsets: Tagset2, a modified version of Tagset1, containing 319 tags, and Tagset3, a simplified version of Tagset2, which excludes grammatical information, with 189 tags. The complete modified tagset (Tagset2) is given in Appendix B. A full description of each of the tags and examples of Arabic words that take those tags now follows. This description is based on that given by Khoja . The five main categories for words are: 1. N [noun] 2. V[verb] 3. P [particle] 4. R [residual] * 5. punc [punctuation] Note that category number 5 is preceded by an asterisk (*). This indicates a modification in the name of the category, or a completely new category (or subcategory) as shall be seen in subsequent examples. The residual category contains foreign words, mathematical formulae and numbers. The punctuation category contains all punctuation symbols, both Arabic and foreign such as (?, ،‫.) "! ؟‬ ‫؟‬ The subcategories of noun are: 1.1 C [common] 1.4. Nu [numeral] 1.2 P[proper] 1.3 Pr [pronoun] 1.5 A [adjective] *1.6 T [title] Adjectives are nouns that describe the aspects of an object. Adjectives inherit the properties of nouns, so they take “nunation” when in the indefinite and can take the definite article when definite. For example, alwld alSgyr “The small boy” contains the adjective Sgyr “small”. This adjective can take the definite article as in ‘darasa alwaladu alSagyr’ “the small boy studied”, and it can also have “nunation” as in ‘hasan Sgyr’ “Hassan is small”. Examples of these subcategories include: 27
  28. 28. • Singular, masculine, accusative, common noun such as ktab “book” in the sentence ‘>x* alwld ktaba’ “the boy took a book”. • Singular, masculine, genitive, common noun such as ktab “book” in the sentence ‘drst mn ktab’ “I studied from a book”. • Singular, feminine, nominative, common noun such as mdrsp “school” in the sentence ‘h*h mdrsp’ “this is a school”. Note here and in subsequent examples that vocalization does not appear in transliteration, because we do not assume dealing with vocalized text. The subcategories of the pronoun are: 1.3.1 P [personal] 1.3.2 R[relative] 1.3.3 D [demonestrative] The personal pronouns can be detached words such as ‘hw’ “he”, or attached to a word in the form of a clitic. The attached pronouns can be attached to nouns to indicate possession, to verbs as direct object, or attached to prepositions such as fyh “in it”. Some examples of pronouns include: • Third person, singular, masculine, personal pronoun, such as hw “him”. • Singular, feminine, demonstrative pronoun, such as h*h “this”. The subcategories of the relative pronoun are: 1.3.2.1 S [spesific] 1.3.2.2 C[common] Examples of relative pronouns include: • Dual, feminine, specific, relative pronoun, such as alltan “who”. • Plural, masculine, specific, relative pronoun, such as al*yn “who”. • Common, relative pronoun, such as ‘mn’ “who”. The subcategories of the numeral are: 1.4.1 Ca [cardinal] 1.4.2 O[ordinal] *1.4.3 Na [numerical adjective] We preferred omitting subcategory 1.4.3 and adding related tags to normal adjectives. This kind of adjectives, however, are not very common and we did not encounter any of them in the corpus we used. Examples of numerals include: • Singular, masculine, nominative, indefinite cardinal number such as ‘>rbEp’ “four”. • Singular, masculine, nominative, definite ordinal number such as ‘alrabE’ “the fourth”. 28
  29. 29. The linguistic attributes of nouns, adjectives, and numerals, that have been used in this tagset are: (i) Gender: M [masculine] (ii) Number: Sg [single] F [feminine] N [neuter] * Plm [masculine sound plural] * Plf [feminine sound plural] *Plb [broken plural] Du [dual] (iii) Person: 1 [first] 2 [second] 3 [third] (iv) Case: N [nominative] A [accusative] G [genitive] (v) Definiteness: D [definite] I [indefinite] Verbs are categorised into three main parts: 1. P [perfect] 2. I[imperfect] Iv [imperative] The definition of perfect verbs not only includes (i) the equivalent of English past tense verbs (i.e. to describe acts completed in some past time) but also (ii) describes acts which at the moment of speaking have already been completed and remain in a state of completion, (iii) describes a past act that often took place or still takes place (i.e. commentators are agreed (have agreed and still agree)), (iv) describes an act which is just completed at the moment by the very act of speaking it (I sell you this), and (v) describes acts which is certain to occur that it can be described as having already taken place (mostly used in promises, treaties and so on) [16]. The imperfect does not in itself express any idea of time; it merely indicates a begun, incomplete, or enduring existence either in present, past or future time. While the imperative verbs order or ask for something to be done in the future. Examples of verbs include: • First person, singular, neuter, perfect verb ‘ksrt’ )‫“(كسرت‬I broke”. • First person, singular, neuter, indicative, imperfect verb ‘>ksr’ (‫“ أكسر‬I break” ‫)أكس‬ ِ • Second person, singular, masculine, imperative verb ‘aksr’ (‫“ اكسر‬Break!” ‫)اكس‬ ِ The verbal attributes that have been used in our tagset are: 29
  30. 30. (i) Gender: M [masculine] F [feminine] (ii) Number: Sg [single] Pl [plural] Du [neuter] (iii) Person: 1 [first] 2 [second] 3 [third] (iv) Mood: I [indicative] S [subjective] N [neuter] j [jussive] The two most notable verbal attributes that are fundamental to Arabic but do not normally appear in Indo-European tagsets are the dual number, and the jussive mood. The subcategories of particle are: 1.1 Pr [prepositions] 1.2 A [adverbial] 1.2 C [conjunctions] 1.4 I [interjections] 1.5E [exceptions] 1.6 N [negatives] 1.7 A [answers] 1.8 X [explanations] 1.9 S[subordinates] *1.10 dt [doutive] *1.11 cr [certain] *1.12 Str [stressive] *LM [lm] *LN[ln] Examples of particles include: • Prepositions fy (‫“ )يف‬in” • Adverbial particles swf (‫ف‬ ‫“ )س‬shall” • Conjunctions w (‫“ )و‬and” • Interjections ya (‫“ )فا‬O” • Exceptions swY ( ِ‫ا‬ ‫“ )س‬Except” • Negatives la (‫“ ال‬Not” ‫)ال‬ • Answers nEm (‫“ )نعم‬yes” • Explanations >y (‫“ )أي‬that is” • Subordinates lw ( ‫“ )ل‬if” 31
  31. 31. 3.2 Corpora used for this work Early in our work, we were faced by the unavailability of corpora for MSA text. Even the ones that we read about in some of the previous works were not easily available, besides being not well fit to our needs. We contacted some of the researchers, but only few of them responded to our request and questions. One of these responses provided a raw corpus of excerpts from two Jordanian magazines, containing about 160,000 words. For the sake of saving time, we preferred working on this corpus than creating our own, in spite of the fact that the corpus needs some processing before it can be used in our experiments. These excerpts were provided as a Microsoft document in Arabic characters which has to undergo a series of preparatory steps to be ready for use in our tagging task, as will be explained in detail in Section 4.1 3.3 The Brill system The Brill system is divided into two separate parts: the learner and the tagger. In the following subsections we explain the way each of these two programs works. 3.3.1 Learner Before the process of learning starts, the truth corpus is undergone a series of preliminary operations to prepare a set of files that are necessary for learning. These operations are sketched in Figure 3-4 and explained in more detail in Section 4.3. Transformation-based error-driven learning, as shown in Figures 3-5 and 3-6, works as follows: First, unannotated text is passed through an initial-state annotator. Various initial state annotators, that represent different levels of complexity, have been used, including: the output of a stochastic n-gram tagger; labeling all words with their most likely tag as indicated in the training corpus; and simply labeling all words as nouns. For example Brill gave two simple algorithms to do that; one assigns to all unknown words the tag “NN” for common noun in the Penn Treebank tagset. And 30
  32. 32. Original corpus Review for errors, typing mistakes, etc (manual) Convert to Brill format (maual) Transliterate (C program) Untagged corpus Tagged corpus1 Untagged corpus Tag it (semi automatic) Divide in two (perl programBrill) Tagged corpus Tagged corpus2 Tagged corpus Untag it (perl programBrill) Untagged corpus Entire Tagged corpus2 Untag it (perl programBrill) Untagged corpus2 Prepare final lexicon Final lexicon Figure 3-4: Preliminary steps for tagging the other assigns to every word in the corpus either of two tags: “NNP” for proper noun if the word starts with a capital letter, or “NN” otherwise. This strategy is based on a conclusion that common nouns constitute a high percentage of an English text. In this research we used a more detailed strategy, where the pattern of the letters of a word is compared with a predefined set of patterns, to determine which word class the word belongs to, making use of the rules of Arabic morphology (Srf). Then a tag is assigned to the word accordingly. If the word does not belong to any of the standard patterns, it is assigned the tag “NCSgFGI” which stands for “single feminine, genitive undefined common noun” since this is the most probable tag for unknown words as noticed when the manually tagged corpus is prepared. The different patterns used to tag unknown words are further shown in Appendix C. Once text has been passed through the initial-state annotator, it is then compared to the truth. A manually annotated corpus is used as our reference for truth. An ordered list of transformations is learned that can be applied to the output of the initial state 32
  33. 33. annotator to make it better resemble the truth. Each transformation has two components: a rewrite rule and a triggering environment. A rewrite rule can be in the form: X Y, meaning Change the tag from X to Y While a triggering environment can be in the form: “al” hasprefix 2 , meaning “if the current word has a 2-letters prefix of ‘al’”. Taken together, the transformation with this rewrite rule and triggering environment would be X Y “al” hasprefix 2, meaning Change the tag of the current word from X to Y if it has a 2-letter prefix of ‘al’. There are two types of rules: Lexical rules and contextual rules. Therefore, there are two learners that have to be run consecutively. First lexical rules are learned, then context rules are learned to refine the tags, and make up for some divergences that may occur in applying the lexical rules. In both the learning procedure is done by passes through the truth corpus, each pass learning the rule that, when applied, minimizes the errors in tagging the corpus as compared to the truth corpus. These rules are then stored in a file in the order they are learned. Thus obtaining two rule files: a lexical rule file and a contextual rule file. The tagger applies these rules in the same order to get similar results. Examples of both types of rules, obtained from the Arabic tagged corpus, are given in Section 4.4, with explanatory comments giving the meaning of each rule. The ideal goal of the lexical module is to find rules that can produce the most likely tag for any word in the given language, i.e. the most frequent tag for the word in question considering all texts in that language. The problem is to determine the most likely tags for unknown words, given the most likely tag for each word in a comparatively small set of words. This is done by transformation rule based learning (TBL) using three different lists: a list consisting of Word Tag Frequency - triples derived from the first half of the training corpus, a list of all words that are available sorted by decreasing frequency, and a list of all word pairs, i.e. bigrams. Thus, the lexical learner module does not use running texts. Once the tagger has learned the most likely tag for each word found in the annotated training corpus and the rules for predicting the most likely tag for unknown words, contextual rules are learned for disambiguation. The learner discovers rules on the basis of the particular environments (or the context) of word tokens. The contextual learning 33
  34. 34. process needs an initially annotated text. The input to the initial state annotator is an untagged corpus, a running text, which is the other half of the annotated corpus where the tagging information of the words has been removed. The initial state annotator also uses a list; consisting of words with a number of tags attached to each word, found in the first half of the annotated corpus. The first tag is the most likely tag for the word in question. and the rest of the tags are, in no particular order. With the help of this list, a Untagged corpus2 Initial state tagger Dummy-tagged corpus Tagged corpus2 (truth) Lexical Learner No Lexical Rules Threshold ? yes stop Figure 3-5: Lexical Rule learning list of bigrams (the same as used in the lexical learning module, see above) and the lexical rules, the initial state annotator assigns to every word in the untagged corpus the most likely tag. In other words, it tags the known words with the most frequent tag for the word in question. The tags for the unknown words are computed using the lexical 34
  35. 35. rules: each unknown word is first tagged with a default tag and then the lexical rules are applied in order. There is one difference compared to the lexical learning module, namely the application of the rules is restricted in the following way: if the current word occurs in the lexicon but the new tag given by the rule is not one of the tags associated to the word in the lexicon, then the rule does not change the tag of this word. Dummy-tagged corpus Context learner Unnotated corpus 2 Tagged corpus2 (Truth) Context Learner No Context Rules Threshold ? yes stop Figure 3-6: Context rule learning When tagging new text, an initial state annotator first applies the predefined default tags to the unknown words (i.e. words not being in the lexicon). Then, the ordered lexical rules are applied to these words. The known words are tagged with the most likely tag. Finally the ordered contextual rules are applied to all words. 35
  36. 36. Unanotated corpus Initial tagging anotated corpus lexical Rules Lexical tagger No Finished rules? yes Context Rules Context Tagger Tagged corpus yes No Finished rules? Stop Figure 3-7: Taggeing 3.3.2 Tagger: The tagger follows the same path as the learner. Starting with any raw text corpus given to it, first it applies the same initial state annotator as the one used in learning, so that the transformation rules work correctly. Then it uses the rule files obtained by the learner, to change initial tags to new tags. The rules are applied in the same order they are collected: first the lexical rules, and then the contextual rules. 36
  37. 37. 3.4 Testing strategies Testing was done using the method of cross validation. Taking in consideration that we do not have a large standard truth corpus, we had to manage with the corpus we tagged. This corpus is divided into three portions, each portion containing about 13,000 words, and the test had to be repeated three times, with a different one third for testing, and the other two thirds for learning each time, then taking the average of the three tests as an overall measure for the performance of the system. This whole experiment is repeated using three versions of the tagset and therefore three versions of corpora: 1. Tagset1: the original detailed Khoja tagset [16] containing 177 tags. 2. Tagset2: the complete modified tagset of 319 tags (Appendix B). 3. Tagset3: a subset of Tagset2 of which grammatical information is excluded for nouns and imperfect verbs, thus reducing the number of tags to 185 tags. All the three tagsets are drastically enlarged by the fact that the system we used does not use stemming prior to the learning and tagging phases. Rather it uses composite tags to tag composite words. A fact that would introduce a new set of tags. As an example for this consider the word balmdrsp ( ‫ بالمدرسب‬If stemming ‫.)بالمدرسب‬ were applied to the system this word would be divided into two separate words b and almdrsp, and would be tagged as b/PC almdrsp/NCSgFGD. But, since we work without stemming, the word is treated as one unit and is tagged as balmdrsp/PC_NCSgFGD, thus introducing the new tag PC_NCSgFGD. Stemming would probably enhance the accuracy of the system, but it would divert our attention to other directions and put extra burdens on the users of the system. 37
  38. 38. Chapter Four Implementation and Testing 4.1 Corpus The corpus used for this study is part of an about 160,000 word corpus of two Jordanian newspapers (Aldustor and Aldustor Aleqtsady). Any MSA corpus would have done the task, but this corpus was gotten at an early stage of the work, and was used henceforth. A lot of preprocessing was needed before using the corpus. The corpus is originally a Microsoft word document, so it has to undergo the following corrections and revision tasks to be ready for our work: 1. There were many typing, spelling, and grammatical mistakes that constituted quite a phenomena in the text, and would hinder the process of tagging and add up to the problem of ambiguity which is already an inherent problem of Arabic texts. These problems had to be fixed beforehand. Examples of such mistakes include: a.Missing hamza, like: ‫احبار ,اقصي ,اوضح ,اشار‬ ‫,احبار‬ b. Misplaced hamza, like: ‫ أجياء‬instead of ‫.تامل‬ ‫تامل‬ ‫إجياء‬ c. ‫ هـ‬instead of ‫ ,ة‬like: ‫. باخر‬ d. ‫ ي‬instead of ‫ ,ى‬or vice versa, like: ‫اعكى‬ ‫.حتهاج ايل ,أ يي ,أمح احمم‬ e.Typing mistakes, like: ‫,لك ل اب ًام الك الل ,الةوئى اب ًام االةوئ‬ ‫ال‬ ‫ال‬ ‫ا‬ ‫ا‬ ‫.األولاوفاتاب اًام ااألول فات ,فاجلياءااب اًام افاإلجياء‬ ‫ال‬ ‫ال‬ f. Grammar mistakes, like: ‫ …اس ر ر اءايفااينر رراراالر ررني ام ابر ررلاال ر ر اء االذ ذذيح األذ ذ ذ ا ذ ذ ذ‬ . ‫المتحدة لألغدادابوعاني ام ابلاسكعااساسو ااوا ارجو‬ .…‫ وفه اف االزواراواليحافه قعاأنافزف اع دهما‬ ‫ و اص اوان‬ ‫ ك لااف ابكغاع داالسرواراتاواآلل ات افرغتايفامونراءاالع ةر ا‬ .)‫ب اًام ا(اليتاأفيغت‬ ‫ال‬ 38
  39. 39. .‫ وتهجاوزاقوم ام ج داهتااعناسةع امكواراتادوالر‬ ‫ انامشرريوااالر زارقاال اضرراابال رراءاال رريفاالهجارفر ايفامياكررزااالل فر ا‬ .)‫شروعاً عشوائ اً اب اًام ا( ااعش اا‬ ‫ا ائا‬ ‫مشيوا‬ ‫ال‬ ‫شروعا‬ .)‫ حيصلاعكىات قوعاالمؤسسوناعكوه اب اًام ا(الؤسسن‬ ‫ال‬ ‫ …اوالر ياخا هلررهاحبررلاالسررةلاالمينر اله نور االهعرراونا ذذا ن ذ ن‬ ‫الجمع ذذع ذذا ال ذذو ا ةايفاتنس ررو ااجلا ر االر ر ين ابر ر اًامر ر ا(ب ررنااجلمعور ر ا‬ ‫ال‬ .)‫وال زارق‬ 2. Getting rid of passage numbers, titles, and end marks to concentrate on complete sentences of text. 3. The text is then converted to an ASCII MS-DOS format. 4. Because of technical considerations; like the different code pages used for representing Arabic characters, and using software that does not support Arabization, especially Lex analyzing system, and Linux environment, it was decided to follow most of the previous line of research in Arabic [i.e. 1,14,21] and use transliteration. For this purpose the Buckwalter code of transliteration [36] is used and a small C program was written to do this task. 5. The corpus is then edited to match the Brill format and copied to the Linux system for the rest of the processing. 6. Then, it is tagged, using a program written with the help of the lexical analyzer LEX [2]. The resulting corpus, calculated to be about 43% accurate, is then revised manually. The result, which is supposed to represent the truth, was then given to the learner of the Brill tagger to learn lexical and contextual rules, a step that also requires some other preparations, as explained in Section 3.3. 7. The above steps are performed initially on a corpus of size about 1000 words. After the rules are learned a larger corpus is presented to the tagger, tagged, manually revised, and given to the learner to enhance the rule set. This process is repeated continuously, hence enlarging the truth corpus and enhancing the performance of the tagger simultaneously, until satisfactory 39
  40. 40. results are obtained and/or enough time is spent on this point. At the present a truth corpus of over 38,000 words is reached. Figures 4-1 and 4-2 show sample sentences in different stages of the tagging cycle. ‫عكى اهامش اأعمال االنه االه سطا الكهنمو ا‬ ‫وال ي اع ايف اال اهيق ا هل اآذار ااجلاري انظما‬ ‫كز االصيي الك راسات ااالقهصادف اورش اعملا‬ ‫الي‬ ‫ح ل اضعف اال ارد االةشيف اواله رفب اوتيضولا‬ ‫ال ول االعيبو الكمنهج ااألجنيب اوأهم امع قاتا‬ ‫كات ايف االنط . اوق اناقشت اه ها‬ ‫الهنافسو الكشي‬ ‫احلك االهط رات االههح ايف ااالقهصاد االعالاا‬ ‫ا‬ .‫كات‬ ‫واليتاأصةحتاتييضاحت فاتاعكىاالشي‬ (a) A sentence from the corpus ElY ham$ >Emal almntdY almtwsTy lltnmyp wal*y Eqd fy alqahrp xlal |*ar aljary nZm almrkz almSry lldrasat alaqtSadyp wr$p Eml Hwl DEf almward alb$ryp waltdryb wtfDyl aldwl alErbyp llmntj al>jnby w>hm mEwqat altnafsyp ll$rkat fy almnTqp . wqd naq$t h*h alHlqp altTwrat almtlaHqp fy alaqtSad alEalmy walty >SbHt tfrD tHdyat ElY al$rkat . (b) A transliteration of the sentence in (a) in the Brill format Figure 4-1 41
  41. 41. 4.2 Tagset The tagset used in this work is a modified version of the tagset designed by Khoja, fully described in [16] and redescribed in Section 3.1.1. The work of Khoja is highly esteemed, being the first comprehensive work in designing a tagset for Arabic, which encompasses the richness and complexity of the language. Nevertheless, it has some ‫ا‬NCSgMGD/ ‫ االنه‬NCPlbMGI/‫ اأعمال‬NCSgMGI/‫ ا اهامش‬PPr/‫عكى‬ ‫ا‬PC_NPrRSSgM/‫ اوال ي‬PPr_NCSgFGD/ ‫ الكهنمو‬NASgMGD/‫اله سطا‬ ‫ا‬NASgMGD/‫ ااجلاري‬Rmoy/‫ اآذار‬PA/‫ ا هل‬RP/‫ اال اهيق‬PPr/‫ ايف‬VPSg3M/ ‫ع‬ ‫ا‬PPr_NCPlfGD/‫الك راسات‬NASgMND/‫االصيي‬NCSgMND/‫ا اكز‬VPSg3M/‫نظم‬ ‫الي‬ ‫ا‬PA/‫اح ل‬ NCSgMGI/‫اعمل‬ NCSgFAI/ ‫اورش‬ NASgFGD/ ‫االقهصادف‬ ‫ا‬NASgFGD/ ‫االةشيف‬ NCPlbMGD/‫اال ارد‬ NCSgMGI/‫ضعف‬ ‫ا‬NCPlbMGD/‫اال ول‬ PC_NCSgMGI/‫اوتيضول‬ PC_NCSgMGD/‫واله رفب‬ ‫ا‬NASgMGD/‫ااألجنيب‬ PPr_NCSgMGD/‫الكمنهج‬ NASgFGD/ ‫العيبو‬ /‫كات‬ ‫ لكشي‬NCSgFGD/ ‫ االهنافسو‬NCPlfGI/‫ امع قات‬PC_NASgMGI/‫وأهم‬ ‫ا‬ ./punc ‫ ا‬NCSgFGD/ ‫ااالنط‬PPr/‫اايف‬PPr_NCPlfGD Figure 4-2: Part of the sentence in Figure 4-1 after tagging and detransliteration limitations and mistakes, some of which are treated in this work, and others may be a task of future work. Modifications considered here include nouns, verbs, and particles. 4.2.1 Nouns: For nouns the following was done: a- Avoiding distinctions between foreign names and Arabic names. Instead all names, whether Arabic or foreign, are given the same tag NP (for proper noun). The tag RF (residual foreign) is kept to refer only to words of foreign languages written in Arabic characters. In the original tagset, the tag (RF) is given to all foreign names and words (see Figure 1-1, compare the tag given to ‫فا‬ given to ‫كي‬ ‫اسهويسيا‬ and ‫لن‬ ‫ا‬ ‫ب‬ ‫.)ا‬ and . 40 with that
  42. 42. b- Using different tags for the different plural forms, and hence the indication of plural nouns is given the subtags PlbM, PlbF, Plm, Plf for broken masculine plural, broken feminine plural, sound masculine plural, and sound feminine plural respectively, instead of just: PlM, and PlF for plural masculine and plural feminine respectively. The table below (Figure (4-3)) gives examples of this. Notice that in our set the gender is not repeated with sound plurals since it is included implicitly in the plural form. word Original tag New tag ‫ال ظي ن‬ NCPlMND NCPlmND ‫العامكن‬ NCPlMGD NCPlmGD ‫الشةيات‬ NCPlFND NCPlfND ‫ال ارس‬ NCPlFND NCPlbFND ‫الةن ك‬ NCPlMGD NCPlbMGD Figure 4-3: Tags of Plurals The last two characters of each tag are irrelevant here and are given only for completeness. Including this information is useful when the resulting tagged corpus is used for morphological studies. c- Introducing some new tags. d- Introducing another general category in addition to common nouns (NC) and adjectives (T (NA), namely title nouns (NT) like: ‫ .) ال في، وزفي، أمن، السيري، الان س، اليئوس، الكا‬This would increase the tagset drastically, since each of these nouns can be single or plural, masculine or feminine, definite or indefinite and can take any of the three cases. But it would help in many cases to discover unknown proper nouns that usually follow these titles. 4.2.2 Verbs: 42
  43. 43. For verbs the modification include: Using distinct tags for defected verbs ( ‫ األفعالاالناقص ا‬to capture the action they take ), with the case of the following nouns. Therefore each verb tag is marked by a small d following the first two characters for the verb if it is a defected verb, as in Figure (44). word Original tag New tag ‫ذهب‬ VPSg3M VPSg3M ‫ف هب‬ VISg3MI VISg3MI ‫كانت‬ VPSg3F VPdSg3F ‫فصةح ن‬ VIPl3MI VIdPl3MI Figure 4-4: Tags of defected verbs 4.2.3 Particles: For particles, the modifications include: Introducing a few tags to refine the tagging of some particles, and to make room for some particles unconsidered in the original tagset; namely: Pcr, Pdt, Pst, PQ, ‫ا‬and ‫ل‬ LM, and LN ‫ن ن‬ ‫ ,(أ ا، إ ا) , ق االهشيويو ,ق االهح و و‬for ‫دواتااالسهياا‬tagging ‫,مل ,أ‬ respectively. All these tags are added to help picking up some information about the following words. Although these tags do contribute to refining the tagset, there is still a lot to be done with particles, since the available tags do not cover the wide range of meanings for particles in Arabic. Examples include: the prefix particle ‫ ف‬is now given the tag PC (for conjunctional particle), whereas it is not always so, but sometimes has different meanings especially when affixed to verbs ( ‫فاءاالسةةو‬ ‫اله‬ ), the same thing goes with . All particles that do not belong clearly to any of the available tags are given a general tag PA (for adverbial particle) regardless of the fact that some of them are not really adverbial, so we do not have to take the meaning of this tag literally. 43
  44. 44. Making more distinctions is left for future work after studying more deeply the need for such refinement. It should be kept in mind, also, that the corpus we dealt with is not stemmed. So the tagging is done by composite tags, which would introduce a new set of tags for composite words. For example, the word ‫بالعرريض‬ is tagged as PPr_NCSgMGD, which is a completely a different tag from either PPr or NCSgMGD thus leading to a drastic theoretical increase in the tagset. Contrary to what was expected, this fact did not cause a lot of problems with the tagging accuracy, due to the fact that the Brill tagger is powerful in dealing with prefixes and suffixes, and that composite words comprise only a small portion of an Arabic text (estimated to be less than 6% according to the data we worked on). 4.3 The program The same Lex-based program, which was used for initial tagging of the very first corpus, is now used as a start state tagger for both learner and tagger of the Brill system. In the original system initial tagging is done by a very simple routine, which assigns to all words either the tag (NN) for common nouns or (NP) for proper nouns if the word starts with a capital letter. This start state suffices for English and similar simple languages, But for Arabic we preferred to use another type of start state tagger, where each unknown word is checked for its syntactic structure and assigned an initial tag accordingly. For this purpose, the Lex-based routine was used after facing a lot of trouble getting it to work, especially since it has to be interfaced to both the lexical learner (written in Perl), and the tagger (written in C). Start state routine is an important factor for getting accurate results, especially for unknown words. And the better it is designed to take care of word structures, the better the achieved results are. In the present, the routine takes care of many morphological structures and relies on the statistical information, sensed when working on manual tagging, to assign the most probable tag for words that do not belong to any of the captured tags. 4.4 Rules 44
  45. 45. In this section we give a list of the resulting rules and explain how they are interpreted, and the actual lexical and contextual information derived thereof. It is worth mentioning here that the obtained rules are based on majority tests and not on absolute truth. In other words, it is not necessary the each rule applies to all situations in any MSA text; rather it applies to most similar situations. As an example of this consider the rule number 9 in Table 4-3 which states: NP NCSgMGI PREV1OR2TAG PPr Meaning that if a word has a preposition as one of its preceding two words, then that word should be tagged as a common noun and not as a proper noun. This rule is derived because, in the training corpus, it turned out that applying this rule would enhance the accuracy of the tagger, by minimizing the discrepancy between the starting corpus and the truth corpus. But that does not mean that the rule has no exceptions. It is easy to think of many exceptions to this rule or to any other rule, but what counts is the overall effect of applying the rule. 4.4.1 Lexical Rules Table 4-1 shows a list of lexical rules, together with the meaning of each rule, and its interpretation in the context of Arabic morphology. While Table 4-2 lists a group of rules that may be considered misleading, which means that although they may enhance the tagging of the training corpus, they will surely have negative effects on the testing and real life corpora. 4.4.2 Contextual Rules Table 4-3 shows a list of contextual rules, together with the meaning of each rule, and its interpretation in the context of Arabic morphology and syntax. 4.5 Testing Many tests are performed to check the efficiency of the system:  In the first group of tests, the truth corpus is divided into three portions of similar sizes, then the cross validation method is used three times for each type of tagset as would be explained below. In each test of the three, two portions of the corpus (about 25,000 words) are used in 45
  46. 46. learning and the third (about 13,000 words) for evaluation, and the average of the accuracy for the three tests is taken as the overall measure for the system’s accuracy. This is performed on three types of corpora: one tagged with original tagset (Tagset1) as introduced by Khoja, the second tagged with a modified tagset thereof (Tagset2), as explained in Section 4.2, and the third tagged with the modified set with the exclusion of grammar features (Tagset3). These three tagsets are defined in Section 3.4. The results of these tests are summarized in Table 5-1, 52, 5-3 respectively.  To test the effect of enlarging the corpus size on accuracy, another group of corpora are prepared. In this case since we do not have a large reference corpus to work on, we had to reduce the size of the testing corpora to enlarge the training corpora. So we chose the size of the learning corpora to be about 31,000 words each, i.e. about five sixths of the size of the complete corpus. And the test corpus is the rest of the corpus, whose size is over 6,000 words. Three tests were performed this way, changing the test corpus each time, and taking the average. The results of these tests are summarized in the tables of Section 5.1. 46
  47. 47. No. 1. Rule al haspref 2 NASgFGD Meaning if a word has a prefix of two letters “al” then tag it as Comments “al” is a sign of difeniteness NASgFGD 2. at hassuf 2 NCPlfGD if a word has a suffix of two letters “at” then tag it as “at” is an ending of fem. plural NCPlfGD 3. NCSgMGI p fchar NCSgFGI If a word tagged as NCSgMGI contains the character “p” , “P” ( ‫) التاء المربوط‬is a sign of femminism then tag it as NCSgFGI 4. y haspref 1 VISg3MI if a word has a prefix of 1 letter “y” then tag it as VISg3MI “y” (‫ ) الياء‬is a prefix for imperfect verb 5. NCSgMGI l fhaspref 1 PPr_NCSgMGI If a word tagged as NCSgMGI has a prefix of 1 letter “l” “l” (‫ ) الدم‬in the beginning of the word is a then tag it as PPr_NCSgMGI proposition If a word tagged as NCSgMGI has a suffix of 1 letter “a” “a”-ending is a sign of accusative case 6. NCSgMGI a fhassuf 1 NCSgMAI then tag it as NCSgMAI 8. NASgFGD p faddsuf 1 NASgMGD NCSgMGI w fhaspref 1 PC_NCSgMGI If possible to add “p” to a word tagged as NASgFGD then Can not have two “p” ( ‫ )تاء مربوط‬in one tag it as NASgMGD 7. word If a word tagged as NCSgMGI starts with w then tag it as “w” is a conjunctional particle PC_NCSgMGI Table 4-1: a list of lexical rules 47
  48. 48. 12. NCSgMGI t fhassuf 1 VPSg3F b deletepref 1 PPr_NCSgMGI followed by “al” for definiteness Any word starting with “ll” should be tagged as “ll” (‫ )للـ‬is a proposition followed by “al” for definiteness If a word tagged as NCSgMGI starts with “t”, tag it as “t” (‫ )ت‬is a suffix for a past tense verb VPSg3F 11. ll haspref 2 PPr_NCSgMGD “wal” (‫ )والـ‬is a conjunctional particle PC_NCSgMGD 10. wal haspref 3 PC_NCSgMGD Any word starting with “wal” should be tagged as PPr_NCSgMGD 9. (third person single feminine) If removing the letter “b” from a word gives a word in the Attached “b” is a proposition. lexicon, tag the original word as PPr_NCSgMGI 13. 0 char Rnu A word containing the character “1” is a number Numeric 14. NCPlfGD al faddpref 2 NCPlfGI If a word is tagged as NCPlfGD and accepts adding prefix Can not add “al” to a defined word. “al”, tag it as NCPlfGI 15. PC_NCSgMGI S-T-A-R-T fgoodright If a word at the beginning of a sentence is tagged as PC_VPSg3M PC_NCSgMGI , then tag it as PC_VPSg3M Table 4-1: a list of lexical rules Continued 48 Can not start with genitive case.
  49. 49. No. Rule Meaning 1. NASgFGD d fhassuf 1 NCSgMGD If a word tagged as NASgFGD ends with “d” tag it as NCSgMGD 2. NCSgMGD al> fhaspref 3 NCPlbMGD If a word tagged as NCSgMGD ends with “n” tag it as NCPlbMGD 3. NCSgMGI n fhassuf 1 NP If a word tagged as NASgFGI ends with “n” tag it as NP 4. NCSgMGI_NPrPSg3F <lY fgoodleft PPr_NPrPSg3F Comment If a word tagged as NCSgMGI_NPrPSg3F is followed by “‫ ”إلى‬tag it as O N L Y A C H A N C E PPr_NPrPSg3F Table 4-2 : Examples of misleading lexical rules 49
  50. 50. No. Rule Meaning 1. NCSgFAI NCSgFGI PREV1OR2TAG PPr Change a tag from NCSgFAI to NCSgFGI if one of the two previous words is tagged PPr 2. NCSgFGD NCSgFND PREV1OR2TAG VPSg3F 3. Pst PA NEXTTAG VISg3MI Change a tag from NCSgFGD to NCSgFND if one of the two previous words is tagged VPSg3F Change a tag from Pst to PA if the next word is tagged VISg3MI 4. VISg3MI VISg3MS PREVWD >n Change a tag from VISg3MS to NCSgFGI if the previous word is >n 5. NCSgMGD NCSgMND PREV1OR2TAG STAART 6. NCSgMGD NASgMGD PREVTAG NCSgMGD Change a tag from NCSgMGD to NCSgMND if the word is one of the two starting words in the sentence Change a tag from NCSgMGD to NASgMGD if the previous word is tagged NCSgMGD 7. NCSgMGI NCSgMNI PREV1OR2TAG STAART 8. NASgFGD NCSgFGD PREVTAG PC_NCSgMGI 9. NP NCSgMGI PREV1OR2TAG PPr Change a tag from NP to NCSgMGI if the previous word is tagged PPr 10. NCSgFGI NCSgFNI PREVTAG VPSg3F Change a tag from NCSgFGI to NCSgFNI if the previous word is tagged VPSg3F Change a tag from NCSgMGI to NCSgMNI if the word is one of the two starting words in the sentence Change a tag from NASgFGD to NCSgFGD if the previous word is tagged PC_NCSgMGI Table 4-3: a list of contextual rules 51 Comment ‫ماابع احيفااجلياجميور‬ ‫الياعلاميف ا‬ ‫ن‬ ‫ااأناولوساأ ا‬ ‫مااقةلااليعلاالضار ا‬ ‫ا‬ ‫أناتنصبااليعلاالضار‬ ‫الةه أاالعيفاميف ااولوساجميوا‬ ‫ر‬ ‫ماابع ااالسماالعيفاصي امعيف ا‬ ‫ولوساامساامعيفاا‬ ‫الةه أاالنييقاميف ا‬ ‫متووزابناالضافاإلوهاوالصي‬ ‫حيفااجليافسة ااالسماالعاديا‬ ‫ولوسااسماالعكم.ا‬ ‫الياعلاميف ا‬
  51. 51. 11. NASgFGD NCSgFGD PREVTAG NCSgMGI Change a tag from NASgFGD to NCSgFGD if the previous word is tagged NCSgMGI ‫متووزابناالضافاإلوهاوالصي‬ 12. NASgFGI NCSgFGI PREVTAG PPr Change a tag from NASgFGI to NCSgFGI if the previous word is tagged PPr ‫ماابع احيفااجليااسماولوساصي‬ 13. PA_VISg3FI NNuCaSgFAI CURWD stp If the current word is stp, change tag from PA_VISg3FI to NNuCaSgFAI Table 4-3: a list of contextual rules Continued 50 ‫ختصو الك اع قاالعجمو ا"ماافة أابرا‬ ‫اامعاسنا‬ ‫ستافا افعلامضار‬ "‫االسه ةال‬
  52. 52. Chapter five Results and discussion 5.1 Results Below are the results for the performed tests. Each table illustrates a group of related tests using the method of cross validation. Table 5-1 gives the results for the original tagset, Table 5-2 for the modified tagset, Table 5-3 for the modified tagset using an enlarged versions of the training corpora, and Table 5-4 for the modified tagset with the case (grammar) information removed Test Training size(words) Test size(words) Test1 Test2 Test3 Average 23834 25372 25786 - 13662 12124 11710 - No. lexical rules 153 149 150 - No. context rules 134 137 161 - Tagging accuracy (%) 73.60 72.07 75.05 73.57 Table 5-1: Accuracy for the Original tagset Test Training size(words) Test size(words) Test4 Test5 Test6 Average 23834 25372 25786 - 13662 12124 11710 - No. lexical rules 120 143 150 - No. context rules 151 158 135 - Tagging accuracy (%) 74.34 72.13 75.69 74.05 Table 5-2: Accuracy for the complete modified tagset 52
  53. 53. Test Training size(words) Test size(words) Test7 Test8 Test9 Average 31422 31467 31634 - 6261 6216 6049 - No. lexical rules 174 176 167 - No. context rules 190 162 148 - Tagging accuracy (%) 75.72 75.39 77.16 76.09 Table 5-3: Accuracy for the complete modified tagset with enlarged training corpora Test Training size(words) Test size(words) Test10 Test11 Test12 Average 23834 25372 25786 - 13662 12124 11710 - No. lexical rules 151 148 145 - No. context rules 83 116 106 - Tagging accuracy (%) 83.89 82.64 85.10 83.87 Table 5-4: Accuracy for the Ungrammatized modified tagset 5.2 Examples of errors in tagging A sample of errors was taken from the error report file, consisting of randomly chosen 38 consecutive lines of the original text for the Grammatized tagset. This sample contains 1079 words, 280 words of which are tagged erroneously. The errors are categorized into fifteen types as in Table 5-5 below, and then the occurrence of every type is counted in the sample, to take an idea about the percentage of each error type. Table 5-6 shows the list of erroneously tagged words of this sample, and for each word, its truth and erroneous tags, and the type of error. Then, Table 5-7 shows a summary of the errors, their count in the sample, and their percentage in descending order. 53
  54. 54. Error type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Meaning Interpreting a title as a common noun Mistagging a broken plural Interchanging an adjective and a common noun Interchanging definite and indefinite Interchanging sound plural and single Interchanging verb with noun Grammatical error Error in composite tag Lexicon Entry Missing Interchanging Dual with sound masculine plural Typing mistake Interchanging adverbial article with stress article Error in gender Taking a common noun for a proper noun Interchanging doubt particle and certainty particle Table 5-5: types of errors Word almdyryn w>SHab r&yp alm&vrat m$rwEathm Truth tag NTPlmGD PC_NCPlbMGI NCSgFGI NCPlfGI NCPlfGI_NPrPPl3M wDE almdyryn alastratyjyp $rwT wtnmyp mharat >salyb wastratyjyat w>hmyth NCSgMGI NTPlmGD NASgFGD NCPlbMGI PC_NCSgFGI NCPlfGI NCPlbMGI PC_NCPlfGD PC_NCSgFGI_NPrPSg3M almdaxl al>rbEa' lmqablat astxdamha NCSgMGD RD PPr_NCplfGI NCSgMNI_NPrPSgF dafws bswysra wsykwn almtHdvyn alr}ysyyn ryma Na}b RP PPr_RP PC_PA_VIdSg3MI NADuMAD NCDuAD NP NTSgMGI System tag NCPlmGD PC_NCSgMGI NASgFGI NCPlfGD NCSgMGI_NPrPPl 3M VPSg3M NCPlmGD NCSgFGD NCPlbMAI PC_NCSgFAI NCPlfAI NCPlbMNI PC_NCPlfGI NCSgFGI_NPrPSg 3M NCPlbMND NCPlbMGD PPr_NCPlfGI NCSgMNI_NPrPSg 3F NCSgMGI NCSgMAI PC_NCSgMGI NCPlmGD NCPlmGD NCSgMAI NTSgMNI Type 1 2 3 4 5 6 1 3 7 7 7 7 4 8 (7,2) 9 11 11 9 9 8 10 10 9 7 Table 5-6: A sample of errors in the Grammatized tests 54 Comments lexicon mistype mistype Lexicon lexicon Du-Plm lexicon
  55. 55. wzyr wEql mtHdvyn w>hm myna >n w>n mst$ar waDHp tktml aldktwrp >Hyana >bwabha NTSgMGI PC_NP NAPlmAI PC_NASgMGI RF PA PC_PA NTSgMHNI NASgFGI VISg3MI NTSgFNI NAPlbMAI NCSgMGI_NPrPSg3F NTSgMNI PC_NCSgMGI NCSgMGI PC_NCSgMAI NCSgMAI Pst PC_Pst NTSgMGI NCSgFNI VISg3FI NASgFGD NCSgMAI NP 7 9 (7,2) (3,7) 9 12 12 11 (7,3) 13 1,7 2,3 8 lexicon (lexicon) mistype (Gender) Starts with >bw Table 5-6 : A sample of errors in the Grammatized tests (continued) Error ty pe 7 2 3 9 6 8 12 4 1 10 5 11 13 Total count 119 30 29 27 24 15 15 9 7 2 1 1 1 280 % 42.5 10.71 10.36 9.64 8.57 5.36 5.36 3.21 2.50 0.71 0.36 0.36 0.36 100 Table 5-7: Percentage error for each error type in the Grammatized tests Word wbmwazap qryp mTar 53 Truth tag PC_PPr_NCSgFI NCSgFI NCSgMI NCSgMI System tag PC_NCSgFI NASgFD NCPlbMI Rnu Table 5-8 A sample of errors in the Ungrammatized tests 55 Type 8 3,4 2 9
  56. 56. kmHwr bSnaEp PPr_NCSgMI PPr_NCSgFI NCSgMI PPr_NCSgMD 8 4,13 stqwm PA_VISg3F PA_VISg3M 13 Tyran kEaml rqmyn wtzyd >rbaHha NCSgMI PPPr_NCSgMI NCDuMI PC_VISg3F NCPlbMI_NPrPSg3F NP NASgMI NCSgMI PC_PA NCPlbFI_NPrPSg3F 14 8 10 8 13 bmtxSSyn mkantha kmrkz PPr_NCPlMI NCSgFI_NPrPSg3F PPr_NCSgMI PPr_NCSgMI NCPlfI_NPrPSg3F NCSgMI 5 5 8 wttkaml LIne PC_VISG3F PC_NCSgMI 6 bd> VPSg3M PPr_NCSgMI 9 wtm tkml wtEzz qryp PC_VPSg3M VISg3F PC_VISg3F NCSgFI PC_NCSgMI NCSgMI PC_NCSgMI NCSgFD 6 6 6 4 ykml sahm t$ark alxarTp vany mltqY VISg3F VPSg3M VISg3F NCSgFD NNuORSgMI NCSgMI VISg3M NCSgMI VPSg3M NCSgFI NP NASgMI 13 6 13 4 9 3 alTa}rat wqTE NCPlfI PC_NCPlbMI NCPlfD PC_NCSgMI 11 2 bal>mm mEZm qd tDr wtDEf <mdadat PPr_NCPlbFD NASgMI Pdt VISg3F PC_VISg3F NCPlfI PPr_NCPlbMD NCSgMI Pcr NASgMI PC_NCSgMI NCPlfD 13 3 15 6 6 4 Table 5-8: A sample of errors in the Ungrammatized tests (Continued) 5.3 Discussion Most of the errors can be categorized as follows: 1. Errors in the case of the word are the highest. 2. Unknown proper nouns (of people and places) cannot be guessed. Only few rules may lead to realizing a proper noun. 56
  57. 57. 3. Distinction between sound masculine plural and dual nouns is not easy for unknown nouns in Genitive and Accusative case states. 4. Some forms of broken plural are intermixed with other forms of names, and not always easily distinguished since the processed text is not vocalized. The above notes can be drawn from Table 5-5 where it is easily noticed that the grammar contributes to the highest portion of the errors (almost half of them). Then comes the broken plural problem, which accounts for 10% of the errors, then the distinction between adjectives and nouns, also close to 10%. After that comes the problem of proper names (names of people, cities, countries, etc.) which takes almost 10%, the problem of past tense verbs, about 9%. Composite tags and adverbial articles contribute by about 5% each, and the rest of error types have insignificant contribution to the overall error percentage. Each of the errors of large contribution to the overall percentage of the errors is justified and expected, although the order and exact rate was not expected to be as it turned out after this test. We think that the following factors were leading factors: 1. The grammatical errors are partially due to the fact that some of the tags do not reflect the case of the word, and hence, it is hard for the learner to conclude the reason of the following word being given its tag, examples of that are proper nouns, relative specific pronouns ( ( ‫أمساءااإلشارق‬ ‫ ,)أمساءاال ص ل‬and demonstrative pronouns ). Giving case information for these tags is expected to help solving this problem, but would drastically increase the already large tagset, a task that we preferred to avoid in the present, but which is a proper consideration for future work. It is worth mentioning that most of the words that are erroneously tagged for this reason are otherwise correctly tagged (i.e. information about category, number, gender, and definiteness are correct). 2. The size of the corpus affected the accuracy of the results, and in fact the error rate was enhanced by two other factors; first since the corpus had to be split into three portions to perform cross validation. And second since the Brill tagger splits the training corpus again into two halves; one to derive lexical rules and the other to derive contextual rules. So starting with a corpus of about 38,000 words, each test is done with 25000 words for training and 13,000 word for evaluation. The training part is divided into two parts each of about 12,500 57
  58. 58. words, for lexical and contextual learning respectively. Had we have a ready corpus to work with matters would be different and we are sure of getting better results. This has been shown by three separate cross validation experiments in which the training corpus was slightly enlarged by about 6000 words, leading to about 2% increase in the accuracy of the system as shown in Table 5-3, a value which does not look very great, but at least gives an indication of increase. 3. Lack of vocalization also makes it hard to distinguish between some of the forms of the past tense verbs, and between them and some of the nouns. In this case accuracy of tagging relies primarily on the statistical information captured in the lexicon for known words, and on context for the unknown words. But it should be remembered that lack of vocalization in itself is not a disadvantage of the corpus, rather it is an advantage for the following reasons: a. The input text to the tagger is rarely expected to be vocalized, since vocalization is not common in most MSA writings b. Vocalization puts an extra burden on the user of the system. c. Getting good results in spite of unvoclization is a credit to the system, and a sign of overcoming the problem of ambiguity without relying on the user to disambiguate the words by vocalization. 5.4 Evaluation Comparing with other reported results, the results we obtained may look low, for example Diab et al [10] reported results of 95.4% accuracy, and Khoja [17] reported 90% of disambiguation accuracy. But studying the mentioned works we notice that the first one dealt with a very small tagset (24 tags) that are based on an English tagset, while the second one did not specify precisely the size of the tagset, rather she talked confusingly about three different levels of tagging with tagsets of 5, 35, and 131 tags, and said she used the smaller tagset for initial tagging. This means that the tagset she used contains a maximum of 35 tags. Consulting her website [37], however, one concludes that the tagging is done using the 5-tag set. The other problem with her results is that she reports that “the statistical tagger achieved an accuracy of around 90% when disambiguating ambiguous words” [17], but checking the statistics she offers, we find that 58
  59. 59. ambiguous words comprise a maximum of 3% of the test corpora, and we do not know the performance accuracy for the rest of the corpora. So taking in consideration the large and rich tagset we worked with, and the unavailability of a standard truth corpus tagged with the same tagset, we think the results obtained here are very promising, and are the best obtained for such a tagset. 5.5 Accomplishments In this work we achieved the following:  Revised the Khoja tagset to satisfy our needs and get rid of some of its limitations. It was expected that this revision would lead to some drawback in the accuracy of the tagger, and we were welling to accept that, but gladly enough, the accuracy of the new system turned out to be slightly higher.  Prepared a manually tagged corpus of moderate size, which enjoys the fact that it is tagged with a rich and comprehensive tagset that we consider the best available for Arabic, and recommend it for being the basis for a standard Arabic morphosyntactic tagset. The size of the corpus we tagged is about 38,000 words, far exceeding in size the only POS tagged Arabic corpus that we know of, which is a 1,700-word corpus prepared by Ms. Khoja. In fact we prepared many versions of this corpus, as follows: o One tagged with the original tagset. o The second tagged with a modified tagset thereof. o And the third is tagged with the modified tagset but excluding syntactic (grammar) features. o All the above corpora are available in both Arabic characters and transliterated form.  Adapted the Brill transformation rule tagger to work with the above corpora and have the first complete tagger for Arabic, which gave –we believe- a very promising accuracy of 75-84% depending on the tagset used.  Prepared in parallel with the corpus a tagged lexicon for Arabic, which would help researchers in NLP tasks for Arabic. 59

×