Imam Mohammad Ibn SaudIslamic UniversityCollege of Computing andInformation ScienceComputer sciences DepartmentPrepared by:Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.NArabic Tokenization andStemmingSupervised by:Dr. Amal Al-Saif.
Arabic Language Characteristics• Writing the letter in ambiguous case cause orthography problems.• Encliticization of a word ending with “ ” or “ ” :• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.word Encliticization of word“their Friday”“collect them”“Your level”
My Approach Sample of Arabic tokenized text: The Bigrams equation that used is:P(wi | sj) is probability of ith word given jth segmentation.P(sj | si-1)is probability of jth segmentation given previous segmentation.
ResultsThe result of My Approach algorithm:• They used Bigrams on 45 files with size of 29092 tokens.• The final accuracy was 98.83%.Recall Accuracy Precision F-measureResult without statisticalsupport0.9877462 0.9802977 0.8617793 0.920473
Results Hybrid algorithm was found to supersede the otherstemming ones. The obtained results illustrate that using the hybrid stemmerenhances the performance of some Arabic process. In Arabic Text Categorization: the averages accuracies are:74.41% for khoja, 59.71% for light stemming, 48.17% forn-grams, and 82.33% for Hybrid stemmer.