Imam Mohammad Ibn Saud
Islamic University
College of Computing and
Information Science
Computer sciences Department
Prepared by:
Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N
Arabic Tokenization and
Stemming
Supervised by:
Dr. Amal Al-Saif.
Arabic Tokenization and
Stemming
Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Introduction
 Arabic language.
 Tokenization.
 Stemming.
Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Arabic Language Characteristics
• Writing the letter in ambiguous case cause orthography problems.
• Encliticization of a word ending with “ ” or “ ” :
• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.
word Encliticization of word
“their Friday”
“collect them”
“Your level”
Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
My Approach
 Sample of Arabic tokenized text:
 The Bigrams equation that used is:
P(wi | sj) is probability of ith word given jth segmentation.
P(sj | si-1)is probability of jth segmentation given previous segmentation.
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Results
The result of My Approach algorithm:
• They used Bigrams on 45 files with size of 29092 tokens.
• The final accuracy was 98.83%.
Recall Accuracy Precision F-measure
Result without statistical
support
0.9877462 0.9802977 0.8617793 0.920473
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Arabic Language Characteristics
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Methodology
 Root-based.
 Light Stemmer.
 N-Gram.
 Hybrid Method.
Root-based
 Example of root-based stemmer
Light Stemmer
 Removed morphemes by Light stemmers
Light Stemmer
 Classification of Light8 stemmer
N-gram
 Statistical stemmer based on calculating a measure of
similarity between a pair of words.
 N-gram techniques:
• Digram.
• Trigram.
N-gram
N-gram techniques:
• ( )
• Digram (N=2)
“
• Trigram (N=3)
N-gram
 The string similarity measures calculated using Dice’s
Coefficient:
S = 2Cwq /(Aw + Bq)
Example :
“
would be:
(2 * 4/(10 +5) = 0.533).
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Hybrid Method
 Incorporates three different techniques for Arabic Stemming.
 The Hybrid algorithm starts with constructing the root file
containing more than 9,000 valid Arabic roots.
Results
Results
 Hybrid algorithm was found to supersede the other
stemming ones.
 The obtained results illustrate that using the hybrid stemmer
enhances the performance of some Arabic process.
 In Arabic Text Categorization: the averages accuracies are:
74.41% for khoja, 59.71% for light stemming, 48.17% for
n-grams, and 82.33% for Hybrid stemmer.
Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.
Conclusion
Thanks

Arabic tokenization and stemming

  • 1.
    Imam Mohammad IbnSaud Islamic University College of Computing and Information Science Computer sciences Department Prepared by: Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N Arabic Tokenization and Stemming Supervised by: Dr. Amal Al-Saif.
  • 2.
  • 3.
    Outline  Introduction  Tokenization: •Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 4.
    Introduction  Arabic language. Tokenization.  Stemming.
  • 5.
    Outline  Introduction  Tokenization: •Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 6.
    Arabic Language Characteristics •Writing the letter in ambiguous case cause orthography problems. • Encliticization of a word ending with “ ” or “ ” : • Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”. word Encliticization of word “their Friday” “collect them” “Your level”
  • 7.
    Outline  Introduction  Tokenization: •Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 8.
    My Approach  Sampleof Arabic tokenized text:  The Bigrams equation that used is: P(wi | sj) is probability of ith word given jth segmentation. P(sj | si-1)is probability of jth segmentation given previous segmentation.
  • 9.
    Outline  Introduction  ArabicCharacteristics.  Tokenization: • Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 10.
    Results The result ofMy Approach algorithm: • They used Bigrams on 45 files with size of 29092 tokens. • The final accuracy was 98.83%. Recall Accuracy Precision F-measure Result without statistical support 0.9877462 0.9802977 0.8617793 0.920473
  • 11.
    Outline  Introduction  ArabicCharacteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 12.
  • 13.
    Outline  Introduction  ArabicCharacteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 14.
    Methodology  Root-based.  LightStemmer.  N-Gram.  Hybrid Method.
  • 15.
    Root-based  Example ofroot-based stemmer
  • 16.
    Light Stemmer  Removedmorphemes by Light stemmers
  • 17.
  • 18.
    N-gram  Statistical stemmerbased on calculating a measure of similarity between a pair of words.  N-gram techniques: • Digram. • Trigram.
  • 19.
    N-gram N-gram techniques: • () • Digram (N=2) “ • Trigram (N=3)
  • 20.
    N-gram  The stringsimilarity measures calculated using Dice’s Coefficient: S = 2Cwq /(Aw + Bq) Example : “ would be: (2 * 4/(10 +5) = 0.533).
  • 21.
    Outline  Introduction  ArabicCharacteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 22.
    Hybrid Method  Incorporatesthree different techniques for Arabic Stemming.  The Hybrid algorithm starts with constructing the root file containing more than 9,000 valid Arabic roots.
  • 23.
  • 24.
    Results  Hybrid algorithmwas found to supersede the other stemming ones.  The obtained results illustrate that using the hybrid stemmer enhances the performance of some Arabic process.  In Arabic Text Categorization: the averages accuracies are: 74.41% for khoja, 59.71% for light stemming, 48.17% for n-grams, and 82.33% for Hybrid stemmer.
  • 25.
    Outline  Introduction  ArabicCharacteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.
  • 26.
  • 27.