Outline:• Introduction• Methods• Constructing An Automatic Lexicon for Arabic Language.• APT: Arabic Part-of-speech Tagger.• The HMM-Based POS Tagger.• The Stemmer• The POS Tagger• Results• Constructing An Automatic Lexicon for Arabic Language.• APT: Arabic Part-of-speech Tagger.• The HMM-Based POS Tagger.• Conclusion
Introduction:* Arabic language• Arabic is the language of millions of people allover the world For that Interest in the Arabiclanguage is growing fast.• Language processing tools for Arabic are yet toachieve the quality and robustness.• So far not been covered enough and still fertilefield.
In the study of languages• Corpus Linguistics refers to a methodologywhich governs a natural language by developingit through a set of theoretical and abstract rules• Corpus Linguistics, originally done by hand, arenow performed by an automated process usingalgorithms in software applications
Part-of-Speech Tagging (POS tagging orPOST)• Part of the Annotation method in the CorpusLinguistics is the process of assigning a part-of-speech to each word in a sentence as well as itscontext in relationship with adjacent and related wordsin a phrase, sentence, or paragraph• A simplified form of this is commonly associated withthe identification of words asnouns, verbs, adjectives, adverbs, etc.
The Arabic verbal structures are composedof three classes• Noun: It is either a name or a word thatdescribes a person, thing or idea.• Verb: It is a word that denotes an action andcould be combined with some particles.• Particle: This class includes everything that isneither a verb nor a noun, prepositions ofcoordination, conjunction.
APT: Arabic Part-of-speech TaggerPreviouslyWordSearch in lexiconFound ?yes noAssign all tagpossibleNot assign anytagMethodology:
NOWAPT: Arabic Part-of-speech Tagger (cont.)WordSearch root in lexiconThere is more ofa tag or did notfind any tag ?Stemmingyes noAssign tag by affixes Tagging
APT: Arabic Part-of-speech Tagger (cont.)• The statistical tagger achieved an accuracy ofaround 90% when disambiguating ambiguouswords with this tagset.
Constructing An Automatic Lexicon for Arabic LanguageMethodology:
Constructing An Automatic Lexicon forArabic Language (cont.)•When calculating the efficiency errors wereignored of stemming process.• The algorithm extracts the only triple roots.%Total%correctwordsincorrectwords#correctwords#Incorrectwords# word96.50%96.50%3.50%30211313Results:
The Tokenizer• Since punctuation marks need to be tagged; it tagsthem as PUNC by pass them to the POS tagger.• The purpose of the tokenization phase is to gothrough some pre-processing steps in order toprepare the input text for the remaining modules.• HMM POS Tagger architecture developed a tokenizerto separate the punctuation marks from the words.Then the tokenizer converts the input text into a listof words using the space as a delimiter. Theresulting list is passed to the stemme.
The Stemmer• Stemming is the process of segmenting andseparating affixes from a stem to produce prefix,stem, and suffix parts.
The HMM-Based POS Tagger• F-measure :[2 x Precision x Recall] / [Precision + Recall]where Precision = Ncorrect / Nresponseand Recall = Ncorrect / Nkey
The HMM-Based POS Tagger (cont.)• The performance of the POS tagger decreasedto55 % when it was used to tag a non-stemmedtext.• Using F-measure ;The HMM tagger achieved97 %.
Conclusion• Part of speech (PoS) tagging are very important andbasic applications of Natural Language Processing• In this paper we highlighted the importance of partof speech tagging in wide range of NLP applications.• We have display the most important technologiesinterested in POS used so far for part of speechtaggers for Arabic text from several papers.