Arabic tokenization and stemming

Imam Mohammad Ibn Saud
Islamic University
College of Computing and
Information Science
Computer sciences Department
Prepared by:
Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N
Arabic Tokenization and
Stemming
Supervised by:
Dr. Amal Al-Saif.

Arabic Tokenization and
Stemming

Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Methodology.
• Results.
 Conclusion.

Introduction
 Arabic language.
 Tokenization.
 Stemming.

Arabic Language Characteristics
• Writing the letter in ambiguous case cause orthography problems.
• Encliticization of a word ending with “ ” or “ ” :
• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.
word Encliticization of word
“their Friday”
“collect them”
“Your level”

My Approach
 Sample of Arabic tokenized text:
 The Bigrams equation that used is:
P(wi | sj) is probability of ith word given jth segmentation.
P(sj | si-1)is probability of jth segmentation given previous segmentation.

Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Methodology.
• Results.
 Conclusion.

Results
The result of My Approach algorithm:
• They used Bigrams on 45 files with size of 29092 tokens.
• The final accuracy was 98.83%.
Recall Accuracy Precision F-measure
Result without statistical
support
0.9877462 0.9802977 0.8617793 0.920473

Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Methodology.
• Results.
 Conclusion.

Arabic Language Characteristics

Methodology
 Root-based.
 Light Stemmer.
 N-Gram.
 Hybrid Method.

Root-based
 Example of root-based stemmer

Light Stemmer
 Removed morphemes by Light stemmers

Light Stemmer
 Classification of Light8 stemmer

N-gram
 Statistical stemmer based on calculating a measure of
similarity between a pair of words.
 N-gram techniques:
• Digram.
• Trigram.

N-gram
N-gram techniques:
• ( )
• Digram (N=2)
“
• Trigram (N=3)

N-gram
 The string similarity measures calculated using Dice’s
Coefficient:
S = 2Cwq /(Aw + Bq)
Example :
“
would be:
(2 * 4/(10 +5) = 0.533).

Hybrid Method
 Incorporates three different techniques for Arabic Stemming.
 The Hybrid algorithm starts with constructing the root file
containing more than 9,000 valid Arabic roots.

Results
 Hybrid algorithm was found to supersede the other
stemming ones.
 The obtained results illustrate that using the hybrid stemmer
enhances the performance of some Arabic process.
 In Arabic Text Categorization: the averages accuracies are:
74.41% for khoja, 59.71% for light stemming, 48.17% for
n-grams, and 82.33% for Hybrid stemmer.

Arabic tokenization and stemming

More Related Content

Similar to Arabic tokenization and stemming

More from Arabic_NLP_ImamU2013

Recently uploaded

Arabic tokenization and stemming