Breaking the Kubernetes Kill Chain: Host Path Mount
Arabic tokenization and stemming
1. Imam Mohammad Ibn Saud
Islamic University
College of Computing and
Information Science
Computer sciences Department
Prepared by:
Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N
Arabic Tokenization and
Stemming
Supervised by:
Dr. Amal Al-Saif.
6. Arabic Language Characteristics
• Writing the letter in ambiguous case cause orthography problems.
• Encliticization of a word ending with “ ” or “ ” :
• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.
word Encliticization of word
“their Friday”
“collect them”
“Your level”
8. My Approach
Sample of Arabic tokenized text:
The Bigrams equation that used is:
P(wi | sj) is probability of ith word given jth segmentation.
P(sj | si-1)is probability of jth segmentation given previous segmentation.
10. Results
The result of My Approach algorithm:
• They used Bigrams on 45 files with size of 29092 tokens.
• The final accuracy was 98.83%.
Recall Accuracy Precision F-measure
Result without statistical
support
0.9877462 0.9802977 0.8617793 0.920473
22. Hybrid Method
Incorporates three different techniques for Arabic Stemming.
The Hybrid algorithm starts with constructing the root file
containing more than 9,000 valid Arabic roots.
24. Results
Hybrid algorithm was found to supersede the other
stemming ones.
The obtained results illustrate that using the hybrid stemmer
enhances the performance of some Arabic process.
In Arabic Text Categorization: the averages accuracies are:
74.41% for khoja, 59.71% for light stemming, 48.17% for
n-grams, and 82.33% for Hybrid stemmer.