Your SlideShare is downloading. ×
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 2013DOI : 10.5121/ijmit.2013.5201 1COMPA...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20132retrieval systems. Hence, a advance...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20133Aljlay [3] has performed experiment...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20134A. Stemming Advantages and Stemming...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20135A. Lovins StemmerThe Lovins Stemmer...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201361. The about-face of a plural to it...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201376. CHARACTERISTICS OF ARABIC LANGUA...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20138language that are different form th...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20139• Past tense do not get conflated w...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 2013108. CONCLUSIONS AND RECOMMENDATIONS...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201311[11] Farag A., Andreas N., "N-Gram...
International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201312- Appendix B -Versions of light st...
Upcoming SlideShare
Loading in...5
×

COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMS

535

Published on

In the context of Information Retrieval, Arabic stemming algorithms have become a most research area of
information retrieval. Many researchers have developed algorithms to solve the problem of stemming.
Each researcher proposed his own methodology and measurements to test the performance and compute
the accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.
Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.
Then, the main Arabic language characteristics that are necessary to be mentioned before discussing
Arabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabic
stemming algorithm is still one of the most information retrieval challenges. This paper aims to compare
the most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, and
information retrieval performance. The results show that the light10 stemmer outperformed the other
stemmers. Finally, recommendations for future research regarding the development of a standard Arabic
stemmer were presented.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
535
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMS"

  1. 1. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 2013DOI : 10.5121/ijmit.2013.5201 1COMPARATIVE ANALYSIS OF ARABICSTEMMING ALGORITHMSDr. Mohammed A. OtairDepartment of Computer Information Systems, Amman Arab University, JordanOtair@aau.edu.joABSTRACTIn the context of Information Retrieval, Arabic stemming algorithms have become a most research area ofinformation retrieval. Many researchers have developed algorithms to solve the problem of stemming.Each researcher proposed his own methodology and measurements to test the performance and computethe accuracy of his algorithm. Thus, nobody can make accurate comparisons between these algorithms.Many generic conflation techniques and stemming algorithms are theoretically analyzed in this paper.Then, the main Arabic language characteristics that are necessary to be mentioned before discussingArabic stemmers are summarized. The evaluation of the algorithms in this paper shows that Arabicstemming algorithm is still one of the most information retrieval challenges. This paper aims to comparethe most of the commonly used light stemmers in terms of affixes lists, algorithms, main ideas, andinformation retrieval performance. The results show that the light10 stemmer outperformed the otherstemmers. Finally, recommendations for future research regarding the development of a standard Arabicstemmer were presented.KEYWORDSInformation Retrieval, Arabic stemmers, Morphological analysis, and Computational Linguistics.1. INTRODUCTIONInformation Retrieval is ultimately an issue of determining which documents in a corpus shouldbe retrieved to satisfy a users information need which is represented by a query, and containssearch term(s), in addition to some information such as the relatively importance. Thus, theretrieval decision is possessed by finding the similarity between the query terms with the indexterms appearing in the document. The decision of the retrieval process may be taking any shapeof the following result: binary: relevant, non-relevant, or partial. In the last case, it may involverating the degree of important relevancy that the document has to the submitted query. [15].Unfortunately, the words that appear in the documents and in queries at the same time often havemany morphological variations (Morphology means the internal structure of words). For instance,some terms such as "extracting" and "extraction" will not be recognized as similar or equivalentwithout any kind of processing. Conflation techniques are needed to equate word variationshaving similar semantic interpretations [15].Arabic stemming is a technique that aims to find the stem or lexical root for words in Arabicnatural language, by eliminating affixes stuck to its root, because an Arabic word can have a morecomplicated form than any other language with those affixes. Morphological variants of wordsaccept agnate semantic interpretations and can be advised as agnate for the purpose of advice
  2. 2. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20132retrieval systems. Hence, a advanced ambit amount stemming Algorithms or stemmers acceptbeen developed to abate a chat to its axis or root. Many researches were conducted to comparethese algorithms. A study by Sawalha [23] gives a comparison between three stemmingalgorithms: Khoja’s stemmer [17], Buckwalter’s morphological analyzer [7] and the Tri-literalroot extraction algorithm [5]. The Khoja stemmer obtains the accomplished accuratenessfollowed by the tri-literal basis abstraction algorithm, and assuredly the Buckwaltermorphological analyzer. Another study by Darwish [10] found that light stemming is one of themost superior in morphological analysis. Similar study were done by Larkey [20] to comparelight stemming with several different stemming algorithms based on morphological analysis:roots, stems, and some other criteria. Their results showed that the light stemmer passed the otheralgorithms in terms of performance (precision and recall) [12].As a summary, Arabic stemming algorithms can be classified, according to the desirable level ofanalysis: root-based approach [17] and stem-based approach [20]. Root-Based approach usesmorphological analysis to find the root of a given Arabic word. Many algorithms have beenproposed for this approach [2, 6, 14]. The aim of the Stem-Based approach is to eliminate themost frequent prefixes and suffixes [3, 8, 19, 20]. A lot of all those trials in this acreage were a setof rules to abbreviate the set of suffixes and prefixes, as well there is no audible account of thesestrippable affixes [24].This paper is conducted to do a comparative analysis for the most of the existing light stemmers.It compares stemmers in terms of the main ideas behind the development of the stemmers, theprefixes and suffixes that can remove, and as well as the affixes. The stemmers also compared interms of their information retrieval performance; precision and recall. A lot of accepted andacknowledged address acclimated for bearing stems of words is the ablaze stemming techniques.This paper is going to compare the stemmers in terms of:• The main idea behind the stemmer built,• The prefixes and suffixes they remove, and• The basis of choosing the affixes• The algorithm they use to remove the affixes.• Information retrieval performance; precision and recall.• Limitation of the stemmer2. LITERATURE REVIEWLiterature search showed the existence of many Arabic language stemmers. The strengths ofthose stemmers are varied and stemming errors are produced for every stemmer. None of theanalyzed stemmers showed perfect performance, and none of them has been adopted as astandard Arabic stemmer that fulfills the user’s information needs.Arabic stemming approaches have been analyzed, weaknesses, and strengths have been pointedout. Among the studied stemmers, there have been aggressive stemmers and weak stemmers.The Larkey [21] has showed performance effectiveness for a group of stemmers that includedlight 1, 2, 3, 8, and 10 as compared to normalization and raw or surface based. The result showedthat Light 10 is the best. In a previous research [20], Larkey showed the results of a comparisonof basic stemmers that included Light 8,3,2,1, Khoja, Khoja-u, normalized, and raw. The Light 8showed the best performance results.James [4] and Hull [16] both showed separately that Stemming performs better than nostemming. In addition, Darwish [9] has experimented with Alstem, Umass, and Modified Umassand his results showed that Alstem as the most effective.
  3. 3. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20133Aljlay [3] has performed experiments on surface-based, root-based, and light stemmer. He arrivedto the point that stemming via a root-based algorithm creates invalid conflation classes. He alsoshowed that light stemming approach outperforms the root-based stemmer.Mayfield et al. [20] have developed a system that merge surface-based and 6-gram based retrievalwhich performed remarkably well for Arabic.Which stemmer is the best? Which one should we use? And what is the standard Arabic stemmerto adopt? Many questions need to be answered regarding Arabic stemmers. Hence, in thefollowing sections some elaborations on conflation techniques and the stemming theory,performance, strength, errors, and advantages will be noted.3. CONFLATION TECHNIQUESConflation is an accepted appellation for all processes of amalgamation calm none identicalwords which accredit to the aforementioned arch concept [11]. In this context, conflation refers tomorphological conflation and morphological query expansion, where the related word formsresemble each other as character strings. Conflation algorithms can be disconnected into twocapital classes as Leah S. Larkey [21] states:1. Affix removal algorithms (stemming algorithms): which are language dependent; anddesigned to handle morphological word variants.2. Statistical techniques: which are (usually) language independent; and designed tomanipulate all types of variants involving n-grams, string similarity, morphologicalanalysis, and co-occurrence analysis.The domain of morphology can be classified into two subclasses, derivational and inflectional[18]. Derivational morphology could or could not impact meaning of a word. Although Englishlanguage has a comparatively weak morphological, other languages such as Arabic have strongermorphology (for example, a lot number of variants maybe given for a word word). In analyzewith the changes of Inflectional assay which describes accepted changes a chat undergoes as anafter effect of syntax (the plural and control anatomy for nouns, and the accomplished close andaccelerating anatomy for verbs are a lot of accepted in English) [18]. There is no effect on aword’s ‘part-of-speech’ for these changes (a noun still remains a noun after pluralisation) [3].4. STEMMINGStemming is a very essential technique for processing strong morphological languages such asArabic. Thus, the definition of stemming and related issues is required to create the necessarybasic knowledge before proceeding further with Arabic stemmers.Stemming is mainly affected by means of suffix lists which are lists of possible wordterminations, and this way were successfully applied on many different languages. However, it isless applicable to complexly morphological language such as Arabic, which require a furthereffort of morphological analysis, where absolutely morphological techniques are required thateliminate suffixes from words according to their internal structure.Thus, conflation is concerned with attempting to reverse the inflexion process by performing theinverse operation related to the basic inflexion rules [15]. Conflating English words faces severalproblems due to the existence of strong verbs that follow no set pattern for inflexion[15], e.g.throw, threw, thrown, and irregular verbs, e.g. go, went, and gone. The use of a lexicon(dictionary) is a must to avoid errors.
  4. 4. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20134A. Stemming Advantages and Stemming ErrorsStemming simplifies the searchers’ job by making the IR system satisfy their information need. Inincrease in recall is gained. However, conflation – basic word form generation with dictionarylookup – can improve precision. Stemming reduces the size of index terms and thus reduces thesize of the index (inverted file), too. As the size of index terms is reduced, the storage space andprocessing time are reduced as well.By the use of stemmers, Words in the collection must be organized into groups, multiple errorsare produced and may be used to compare and evaluate stemmers.• If the two words accord to the aforementioned semantic category, and are adapted to theaforementioned stem, again the conflation is correct. If they are adapted to altered stems,this is an understemming absurdity [15] (in added words, too abundant of a appellation isremoved).• If the two words accord to altered category, and abide audible afterwards stemming,again the stemmer has proceeded correctly. If they are adapted to the aforementionedstem, this is advised as an over-stemming [15] absurdity in added words, too little of aappellation is removed).B. Stemmer Performance and Strength measurementWhen stemming algorithms are used, the effectiveness of the retrieval system is enhanced if thesize of the retrieval set is taken into account as Hull (1996) noted [16]. There are many measuresto evaluate stemmer effectiveness and performance such as: Direct assessment, Precision andRecall, and counting both Stemming Errors.When the stemmer merges a few of the most highly related words together, it is called a ‘weak’or ‘light’ stemmer. A strong or heavy stemmer combines a much wider variety of forms. The setof metrics that measure stemmer strength as follows [13]:• Number of words per conflation class• Index Compression: the ad-measurements to which a accumulating of altered words isbargain (compressed) by stemming.• The Word Change Factor: This is an artlessly the ad-measurements of the words in asample that accept been afflicted in any way by the stemming process.• Mean Number of Characters Removed• Hamming Distance: The Hamming Distance takes two strings of according breadthand counts the amount of agnate positions area the characters are different. If thestrings are of altered lengths, we can use the Modified Hamming Distance.That was stemming theory and stemmers’ characteristics. In the next section, the genericstemming algorithms for English will be analyzed in order to check if they may fit for Arabic ornot.5. GENERIC STEMMING ALGORITHMSStemming algorithms are numerous; in this section, a review of the generic stemming algorithmswill be summarized as stated in [15] which they were developed mainly to English language andnot to Arabic. However, they are summarized here for the purpose of proving that those stemmersare not fit for Arabic and should not be considered in the final comparative analysis.
  5. 5. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20135A. Lovins StemmerThe Lovins Stemmer [15] is a single pass, context-sensitive, and longest-match Stemmerdeveloped in 1968. The approach is not complex enough to stem many. Lovins aphorism accountwas acquired by processing and belief a chat sample. The capital botheration with this action isthat it has been begin to be awful capricious and frequently fails to anatomy words from thestems, or matches the stems of like acceptation words.The Lovins Stemmer removes a best of one suffix from a word, consistent to its attributes asindividual canyon algorithm. Lovins Stemmer uses a almost abbreviate account of about 250altered suffixes, and eliminates the longest suffix affiliated to the word, ensuring that the axisafterwards the suffix has been removed is consistently at atomic 3 characters long. Then theterminating of the stem may be reformed (e.g., by un-doubling a final consonant if applicable), byindicating to a list of recoding transformations.B. Dawson StemmerThe Dawson developed the Stemmer in 1974; it is strongly based upon the Lovins Stemmer,prolongation the suffix rule list to about 1200 suffixes. It acquire the longest bout and individualcanyon attributes of Lovins, and exchanges the recoding rules, which were begin to be wildcat,application instead a constancy of the fractional analogous action as well authentic aural theLovins Stemmer.The agnate affair amid the Lovins and Dawson stemmers is that every catastrophe independentaural the account is associated with an amount that is acclimated as a basis to seek an account ofexceptions that accomplish assertive altitude aloft the abatement of the associated ending.The above aberration amid the Dawson and Lovins stemmers is the address acclimated to breakthe botheration of spelling exclusions. The Lovins stemmer employs the technique known asrecoding. This action is advised as allotment of the capital algorithm and performs n amount oftransformations based on the belletrist aural the stem. In comparing with the Dawson stemmeremploys fractional analogous which attempts to bout stems that are according aural assertivelimits.C. Paice/Husk StemmerThe Paice/Husk Stemmer was developed in the backward 1980s; the Stemmer has beenimplemented in Pascal, C, PERL and Java. When operating with its accepted rule-set, it is a ratherstrong or heavy stemmer. It is a simple accepted Stemmer; it removes the endings (suffixes)from a chat in a broad amount of steps.D. Porter StemmerThe Porter stemmer was first presented in 1980. The stemmer is an ambience acute suffixabatement algorithm. It is based on the abstraction that the suffixes in the English accent(approximately 1200) are mostly fabricated up of an aggregate of abate and simpler suffixes. It isa lot of broadly acclimated of all the stemmers and implementations in abounding languages areavailable. The stemmer is divided into a number of linear steps, five or six; a linear step Stemmer.Porter himself implemented the algorithm in Java, C and PERL. Porter developed an ImprovedPorter stemmer as well.E. Krovetz StemmerThe Krovetz Stemmer [18] was developed in 1993 as a light stemmer. The Krovetz Stemmerfiner and accurately removes inflectional suffixes in three steps:
  6. 6. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201361. The about-face of a plural to its individual anatomy (e.g. `-ies, `-es, `-s), the about-faceof accomplished to present tense (e.g. ‘-ed’), and the removal of ‘-ing’.2. The about-face action firstly removes the suffix, and again admitting a action ofanalytical in a concordance for any recoding (also getting acquainted of exceptions to theaccustomed recoding rules), allotment the axis to a word. The concordance lookup aswell performs any transformations that are appropriate due to spelling barring and as wellconverts any axis produced into an absolute chat that acceptation can be grasping.3. Due to the high accuracy of the stemmer, but weak strength, it is used as a form of pre-processing performed before the main stemming algorithm (such as the Paice/Husk orPorter Stemmer). This would provide partly stemmed ascribe for the stemmer that dealswith accepted situations accurately and effectively, and accordingly could abatestemming errors.F.Truncate (n) StemmerThis stemmer simply retains the first n letters of the word, where n is a appropriate integer,such as 4, 5 or 6. If the chat has beneath than n belletrist to alpha with, it is alternate unchanged.After truncation, words are compared to each other. If the retained parts are similar, they areconflated to the same group, otherwise they are not. There are some problems with this approach;manual construction of the grouped word collection is time-consuming and conflationgroups depend on topic of original text.G. N-grams (String Similarity)String-similarity approaches to conflation absorb the arrangement artful admeasurements ofaffinity amid an ascribe concern appellation and anniversary of the audible agreement in thedatabase. Those database agreements that accept an animated affinity to a concern appellation areagain displayed to the user for accessible admittance in the query. N-gram matching technique isone of the most common of these approaches [14]. An n-gram is a set of n successive characterstaken away from a word. The main idea of this approach is that, similar words will have anelevated proportion of n-grams in common. Idealistic values for n are 2 or 3, thesecorresponding to the use of diagrams or trigrams, respectively. For example, the word (computer)results in the generation of n-grams as shown in table (1).Table 1. Different Digrams and Trigrams for ‘computer’Where ‘*’ denotes a padding space.We approved to administer anniversary of the antecedent stemmers to Arabic but abominablynone of them seems to be acceptable candidate. Hence, it is required to elaborate further on theArabic language characteristics in order to understand and analyze the Arabic stemmers.
  7. 7. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201376. CHARACTERISTICS OF ARABIC LANGUAGEArabic accent is announced by over 300 actor people; as compared to English language, Arabiccharacteristics are assorted in abounding aspects. In [22] Nizar summarized most of the Arabiclanguage characteristics.A. Arabic script or alphabetsArabic is accounting from appropriate to left, consists of 28 belletrist and can be continued to 90by added shapes, marks, and vowels [3]. Arabic script and alphabets differ greatly whencompared to other language scripts and alphabets in the following areas:(1) Shape, marks, diacritics, Tatweel (Kashida), Style (font), and numerals(2) Distinctive letters ( ‫ب‬ ‫ت‬ ‫ث‬ ‫س‬ ‫ش‬ ), and none distinctive letters ( (‫اإأآىء‬ ‫)ؤ‬B. Arabic Phonology and spelling(1) 28 Consonants, 3 short vowels , 3 long vowels (‫ا‬ ‫و‬ ‫,)ي‬ and 2 diphthongs ( ‫علة‬ ‫حرفا‬ ‫إدغام‬‫معا‬ ‫)متصالن‬(2) Encoding is in CP1256 and UnicodeC. Morphology(1) Consists from bare root verb form that is triliteral, quadriliteral, or pentaliteral.(2) Derivational Morphology (lexeme = Root + Pattern)(3) Inflectional morphology (word = Lexeme + Features); features are(4) Noun specific: (conjunction, preposition, article, possession, plural, noun)Number: collective, plural, dual, singular.Gender: feminine, masculine, Neutral.Definiteness: definite, indefinite.Case: accusative, genitive, nominative.Possessive clitic.(5) Verb specific: (conjunction, tense, verb, subject, object)Aspect: imperative, perfective, imperfective.Voice: active, passive.Tense: past, present, future.Mood: jussive, indicative, subjunctive.Subject: person, number, gender.Object clitic.(6) Others: Single letter conjunctions and single letter prepositionsD. Morphological ambiguityDerivational ( ‫قاعدة‬ ) base or rule?, Inflectional ( ‫ﺗكتب‬ ) you write or she writes?, spellingambiguity due to missing diacritics and miss spelling ( ،‫ا‬ ،‫ة‬ ‫ي‬ ), and Combined ambiguity.For the purpose of advice retrieval this affluence of forms, lexical variability, and orthographic airheadedness after effect in a greater likelihood of conflict amid the anatomy of a chat in a concernand the forms begin in abstracts accordant to the query. Stemming is a tool that is used to combatthis vocabulary mismatch problem.7. ARABIC STEMMERSVariation in morphological properties among worlds languages is high, and stemmers arelanguage dependent as stated earlier. Hence, it is expected to see distinct stemmers for the Arabic
  8. 8. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20138language that are different form the English ones. The factors described in section (4) makeArabic very difficult to stem. Abu-Salem [1] documented that stems or roots are useful indexterms for Arabic. Their benefits were clearer than for English, as the Arabic language is a rootbased language.The affair of whether roots or stems are the adapted akin of assay for IR has been one aggravationthat has accustomed acceleration to added approaches to stemming for Arabic accent besidesAffix abatement and Statistical Stemming approaches as declared in section(3). Addedapproaches include manual dictionary construction, morphological analysis, and new statisticalmethods involving alongside corpora [3].However, this paper concentrates on the analysis of Arabic stemming algorithms only. In thissection, a summary of the major Arabic stemming techniques are analyzed and compared.A. Affix Removal(1) Normalization which functions as follows [20]:• Converting to windows Arabic encoding (CP1256)• Remove punctuations• Remove diacritics• Remove none letters• Replace ‫أ،إ،آ‬ with ‫ا‬• Replace ‫ى‬ with ‫ي‬• Replace ‫ة‬ with ‫ه‬This process is usually conducted as a pre-processing step before stemming. The main issue is toknow whether these steps are necessary or not are debatable. The authors strongly agree onnecessity of some of those steps but not all of them. Diacritics removal produces inflectionalambiguity. The three replacements steps affect the spelling ambiguity. As a result, precision andrecall are lowered.(2) Surface-based stemmers that comprise from at least two morphemes as stated in [3]:• A three consonantal root conveying semantics• A word pattern ‫الوزن‬ carrying syntactic informationConflation is based on the surface words that exist in the user query and the corpora documents.(3) Root-based stemmers the ultimate goal of a root-based stemmer is to extract the root of agiven surface word. The root extraction is performed after the suffixes and prefixes are removed.The remaining stem is then compared against matching patterns of the same length to extract theroot as shown in table (2) by dropping similar associated letters [3]. Weaknesses of root-basedstemmers are:• Increases word ambiguity• All possible patterns are not included• Weak double triliteral verbs conflation• Weak Irregular triliteral verbs conflation(4) Algorithmic Light Stemmers which remove a small set of prefixes and suffixes withoutdealing with infixes or recognize patterns, and find roots as noted in [3] and [20]. Theirdrawbacks are as follows:• No absolute abundant lists of strippable prefixes and/ or suffixes or algorithm had beenpublished.• It fails to conflate broken plurals for nouns• Adjective generally does not provide conflated with its singular form
  9. 9. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 20139• Past tense do not get conflated with their present formsMany versions exit for the light stemmer approach; all of which follow the same steps:Remove ‫,و‬ remove any of the definite article from Appendix A, and remove suffixes fromAppendix B.Among the light versions shown in Appendix A and B, it can be easily concluded thatAlstem is the strongest light stemmer and the Light 1 is weakest light stemmer.(5) Simple Stemmers are Light stemmers and removals of infixed vowels ‫ا،و،ي،ء‬ from variantpatterns as Larkey noted in [20]. Many Simple stemmer versions do exist and are shown in thetable (3).Affix removal algorithms are scaled from weak to strong stemmers. Mean average precision andsome other attributes for the major stemmers currently used are analyzed in Appendix C.Investigation of the Appendices A, B, and C shows that those stemmers are very difficult to becompared unless those measured are conducted on the same document collection.B. Manually Constructed DictionariesThey are manually congenital dictionaries of roots and stems for anniversary chat to be indexed.C. Morphological Analyzers (Stemmers)Which attempt to find roots or any number of possible roots for each word automatically by asoftware program.D. Statistical Methods involving Parallel CorporaStatistical stemmers, which accumulation chat variants application absorption techniques and Co-occurrence assay methods that are activated to automated morphological assay software systems.However, those techniques cannot be expected to perform well on the Arabic language due itsstrong morphology.Table 2. Extracting root from stem by comparing patternsTable 3. Versions of Simple Stemmers
  10. 10. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 2013108. CONCLUSIONS AND RECOMMENDATIONSIn this work, the definitions concerning stemming approaches, analysis of the generic stemmers,the morphological structure of the Arabic language, and the published Arabic stemmers havebeen discussed. The previous work by many authors proved greatly that their work wasautonomous, and no signee towards a standard Arabic stemmer was shown. Details of stemmers’algorithms were not available, suffixes and prefixes lists were not clear and some were notavailable as well. Every author develops his own algorithm and named it after himself or as he/she likes.Every columnist called an accumulation of stemmers for appraisal purposes, and the after-effectsof the appraisal are referred to that accumulation only. Diverse evaluation samples were used andmany stemmers were developed. This can be easily seen from the previous work mentioned insection (2).As a result of this study, the authors suggest the followings:A. A research should be conducted to analyze most of the effective published Arabic stemmers inorder to decide on a standard Arabic stemmer.B. The standard Arabic stemmer must take the Arabic diacritics into considerations, because thisaffects the semantics of the language. Hopefully, this will reduce stemming ambiguities anderrors.C. The future Arabic stemmer has to be very intelligent in order to deal with all kinds of wordvariants. This will probably use what the authors of this work call, a Hybrid Intelligent ArabicStemmer (HIAS).D. An agreed upon test collection has to be chosen and standardized so as all stemmers are testedagainst it.The author proposes that such a collection must be the Holy Quran.REFERENCES[1] Abu-Salem H., Al-Omari M. & Evens M., "Stemming methodologies over individual query words foran Arabic IR system". Journal of the American Society for Information Science, 50: 524 – 529, 1999.[2] Al-Fedaghi S. and Al-Anzi F., “A new algorithm to generate Arabic root-pattern forms”. Inproceedings of the 11th national Computer Conference and Exhibition. PP 391-400. March 1989.[3] Aljlayl M. and Frieder, O., "On Arabic Search: Improving the Retrieval Effectiveness via a LightStemming Approach". Proceedings of the eleventh international conference on Information andknowledge management, 340-347, 2002.[4] Allan J. and Kumaran G., "Details on stemming in the language modelling framework". In UMassAmherst CIIR Tech. Report, IR-289, 2003.[5] Al-Shalabi R., Kanaan G., & Al-Serhan H., "New approach for extracting Arabic roots". Inproceedings of The International Arab Conference on Information Technology, 2003.[6] Al-Shalabi R. and M. Evens. “A computational morphology system for Arabic”. In Workshop onComputational Approaches to Semitic Languages, COLING-ACL98. 1998.[7] Buckwalter T., "Buckwalter Arabic Morphological Analyzer Version 2.0". Linguistic DataConsortium (LDC) catalogue number LDC2004L02, ISBN 1-58563-324-0, 2004.[8] Chen A. and F. Gey. “Building an Arabic Stemmer for Information Retrieval”. In Proceedings of the11th Text Retrieval Conference (TREC 2002), National Institute of Standards and Technology, 2002.[9] Darwish K. , Douglas W. Oard, "CLIR Experiments at Maryland for TREC 2002: EvidenceCombination for Arabic-English Retrieval". In TREC, 2002.[10] Darwish K. and Oard D., “Term Selection for Searching Printed Arabic”, in Proceedings of the 25thACM SIGIR Conference, pp. 261–268, 2002.
  11. 11. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201311[11] Farag A., Andreas N., "N-Grams Conflation Approach for Arabic Text", SIGIR07 iNEWS07workshop, 2007.[12] Fouzi H., Aboubekeur H., and Abdulmalik S., "Comparative study of topic segmentation Algorithmsbased on lexical cohesion: Experimental results on Arabic language", The Arabian Journal forScience and Engineering, Volume 35, Number 2C, 2010.[13] Frakes W., Fox C. “Strength and similarity of affix removal stemming algorithms”. ACM SIGIRForum, Volume 37, No. 1., 26-30, 2003.[14] Freund, G. and Willett P., "Online Identification of word variants and arbitrary truncation searchingusing a string similarity measure". Information Technology: Research and Development 1: 177-187,1982.[15] Hoopr R. & Paice C., "The Lancaster Stemming Algorithm", 2005.http://www.comp.lancs.ac.uk/computing/research/stemming/index.htm.[16] Hull D., "Stemming algorithms: a case study for detailed evaluation", Journal of the AmericanSociety for Information Science, 47(1), 70-84, 1996.[17] Khoja S. and Garside R., "Stemming Arabic text". Technical report, Computing Department,Lancaster University, Lancaster, 1999.[18] Krovetz R., "Viewing Morphology as an Inference Process", ACM Press, p.p. 191202, 1993.[19] Larkey L., and M. E. Connell. “Arabic information retrieval at UMass in TREC-10”. Proceedings ofTREC 2001, Gaithersburg: NIST. 2001.[20] Larkey S., Ballesteros L., Margaret E. Connell, “Improving Stemming for Arabic InformationRetrieval: Light Stemming and Occurrence Analysis”, in Proc. of the 25th ACM InternationalConference on Research and Development in Information Retrieval (SIGIR’02), Tampere, Finland,pp.275–282, 2002.[21] Larkey S., Ballesteros L., Margaret E. Connell, "Light Stemming for Arabic Information Retrieval,Arabic Computational Morphology Text", Speech and Language Technology Volume 38, 2007, pp221-243.[22] Nizar H., "Introduction to Arabic Natural Language Processing", ACL’05 Tutorial University ofMichigan - 2005.[23] Sawalha M. and Atwell E., “Comparative Evaluation of Arabic Language Morphological Analyzersand Stemmers”, in Proceedings of COLING-ACL, 2008.[24] Syiam M., Fayed Z. & Habib M., "An Intelligent System for Arabic Text Categorization", IJICIS,Vol.6, No. 1, 2006.- Appendix A –Versions of light stemmers with prefixes
  12. 12. International Journal of Managing Information Technology (IJMIT) Vol.5, No.2, May 201312- Appendix B -Versions of light stemmers with suffixes- Appendix C -Comparison of Arabic Light StemmersAuthorMohammed A. Otair is an Associate Professor in Computer Information Systems,at Amman Arab University-Jordan. He received his B. Sc. in Computer Sciencefrom IU-Jordan and his M.Sc. and Ph.D in 2000, 2004, respectively, from theDepartment of Computer Information Systems-Arab Academy. His majorinterests are Mobile Computing, Databases, ANN.He has more than 30publications.

×