Your SlideShare is downloading. ×

Arabic tokenization and stemming

863
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
863
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
54
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Imam Mohammad Ibn SaudIslamic UniversityCollege of Computing andInformation ScienceComputer sciences DepartmentPrepared by:Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.NArabic Tokenization andStemmingSupervised by:Dr. Amal Al-Saif.
  • 2. Arabic Tokenization andStemming
  • 3. Outline Introduction Tokenization:• Arabic Characteristics.• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 4. Introduction Arabic language. Tokenization. Stemming.
  • 5. Outline Introduction Tokenization:• Arabic Characteristics.• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 6. Arabic Language Characteristics• Writing the letter in ambiguous case cause orthography problems.• Encliticization of a word ending with “ ” or “ ” :• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.word Encliticization of word“their Friday”“collect them”“Your level”
  • 7. Outline Introduction Tokenization:• Arabic Characteristics.• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 8. My Approach Sample of Arabic tokenized text: The Bigrams equation that used is:P(wi | sj) is probability of ith word given jth segmentation.P(sj | si-1)is probability of jth segmentation given previous segmentation.
  • 9. Outline Introduction Arabic Characteristics. Tokenization:• Arabic Characteristics.• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 10. ResultsThe result of My Approach algorithm:• They used Bigrams on 45 files with size of 29092 tokens.• The final accuracy was 98.83%.Recall Accuracy Precision F-measureResult without statisticalsupport0.9877462 0.9802977 0.8617793 0.920473
  • 11. Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 12. Arabic Language Characteristics
  • 13. Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 14. Methodology Root-based. Light Stemmer. N-Gram. Hybrid Method.
  • 15. Root-based Example of root-based stemmer
  • 16. Light Stemmer Removed morphemes by Light stemmers
  • 17. Light Stemmer Classification of Light8 stemmer
  • 18. N-gram Statistical stemmer based on calculating a measure ofsimilarity between a pair of words. N-gram techniques:• Digram.• Trigram.
  • 19. N-gramN-gram techniques:• ( )• Digram (N=2)“• Trigram (N=3)
  • 20. N-gram The string similarity measures calculated using Dice’sCoefficient:S = 2Cwq /(Aw + Bq)Example :“would be:(2 * 4/(10 +5) = 0.533).
  • 21. Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 22. Hybrid Method Incorporates three different techniques for Arabic Stemming. The Hybrid algorithm starts with constructing the root filecontaining more than 9,000 valid Arabic roots.
  • 23. Results
  • 24. Results Hybrid algorithm was found to supersede the otherstemming ones. The obtained results illustrate that using the hybrid stemmerenhances the performance of some Arabic process. In Arabic Text Categorization: the averages accuracies are:74.41% for khoja, 59.71% for light stemming, 48.17% forn-grams, and 82.33% for Hybrid stemmer.
  • 25. Outline Introduction Arabic Characteristics. Tokenization:• Methodology.• Result. Stemming:• Arabic Characteristics.• Methodology.• Results. Conclusion.
  • 26. Conclusion
  • 27. Thanks