Slides of the paper Arabic-SOS Segmenter, Stemmer and Orthography Standardizer for the Arabic Cultural Heritage by Emad Mohamed & Zeeshas Sayyed at the 3rd Edition of the DATeCH2019 International Conference
1. Arabic-SOS
Segmenter, Stemmer, and Orthography Standardizer for the Arabic
Cultural Heritage
Emad Mohamed & Zeeshan Sayyed
May 2019
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 1 / 47
2. Background
Emad Mohamed
Senior Lecturer, Research Group in Computational Lingustics,
University of Wolverhampton
Morphological Analysis, Syntactic Analysis, Computational Corpus
Linguistics, Language Resources
Zeeshan Ali Sayyed
PhD Candidate in Computer Science, Indiana University
Machine Learning, NLP, Parsing Morphologically-rich Languages
Research associate in the Arabic Cultural Analytics project at Doha
Institute for Graduate Studies, Qatar.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 2 / 47
3. Roadmap
Segmentation
What is it, and why do we need it?
Data & Methods
Experiments & Results
Substandard Orthography
The Problem
Data & Methods
Experiments & Results
Effect of Substandard Orthography on Segmentation
Stemming as a by-product of segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 3 / 47
9. Why is segmentation important
Word2vec is an algorithm for finding related words.
word2vec(book): book, books, novels, novel, manuscript, author,
fiction, essay, poem, poems
word2vec(ktAb): ktAb, AlktAb, wAlktAb, ktb, Alktb, llktAb, ktAby,
bAlktAb, fAlktAb, ktAbnA
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 9 / 47
10. Why is segmentation important
Segmentation is a required step for, or significantly improves:
POS tagging
Parsing
Named Entity Recognition
Machine Translation
Lexical Analysis
...
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 10 / 47
11. Previous Work
Segmentation is a hot topic in Arabic NLP.
Many systems exist: MADA, AMIRA, MADAMIRA, FARASA
These systems handle Modern Standard Arabic or Colloquial Arabic
These systems fail on the Arabic cultural heritage. This is really
worrying given that Arabic is a continuum.
We focus on pre-MSA Arabic in this talk, but we are working
on a universal model for the Arabic language.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 11 / 47
12. Previous Work
System MSA Accuracy CA Accuracy
MADAMIRA 98.3 94.3
FARASA 98.5 86
Table: Performance on Modern Standard Arabic and on Classical Arabic
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 12 / 47
13. Our approach to segmentation
Data & Annotation
Experiments
Results
Problems and Solutions
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 13 / 47
14. Annotation
Randomly selected a corpus from the Qur’an, Hadith, Islamic Law,
Islamic Philosophy, and the Al-Manar Magazine (1898-1935)
Segment it using a model built on the ATB
Select the most frequent ngrams, pass these to an annotator.
Iteratively do this to build a test set, training set, and dev set.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 14 / 47
15. Set Source #Words
train 1 Al Manar 85 312
train 2 Al Manar + Classical 141 766
dev Al Manar 23 786
test 1 Al Manar 24 005
test 2 Classical 5 299
Table: Statistics of the datasets used for the experiments
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 15 / 47
16. Gradient Boosting Machines
Sequential Ensemble Method.
Uses Regression Decision Trees.
Multiple Iterations
Each subsequent iteration focuses on those parts of the problem that
the previous iterations got wrong.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 16 / 47
17. Gradient Boosting Machines
Sequential Ensemble Method.
Uses Regression Decision Trees.
Multiple Iterations
Each subsequent iteration focuses on those parts of the problem that
the previous iterations got wrong.
”There are only two Machine Learning algorithms: Gradient Boosting
& Neural Netweorks”
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 17 / 47
18. Features for Segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 18 / 47
19. Experiments: Algorithms
Compare SVM’s, CRF’s and GBM’s
Gradient Boosting Machines produce the best results
XGBOOST
CATBOOST (Yandex)
LightGBM (Microsoft)
Settled on CATBOOST: best results
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 19 / 47
20. Al Manar Classical
CRF-Baseline 92.7% 94.96
SOS (Manar) 97.18% 97.17
SOS (Manar + Classical) 97.45 98.47
Table: Baselines segmentation accuracy using CRF
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 20 / 47
21. System Accuracy
SOS-Manar 97.17
SOS-Manar + Classical 98.47
Mohamed (2018) 96.8
MADAMIRA 94.7
SAPA 86.47
Table: Comparison with other segmenters on classical test set
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 21 / 47
22. Feature Ranking
Feature Value Feature Value
focus 15.6501 prev word suffix 4.2443
next2letters 11.857 chr position 3.2133
prev2letters 8.8664 minus2 3.1651
focus word prefix 7.8821 minus3 2.7478
plus1 7.3599 plus4 2.5857
focus word suffix 6.9752 plus5 2.566
plus3 6.7646 following word prefix 2.5203
plus2 5.5329 minus4 2.1905
minus1 4.7142 minus5 1.1644
Table: Feature importances ranked by the model
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 22 / 47
23. Takeaways
We have a good segmenter that achieves almost the same as the
MSA segmenters with less than one quarter of the data that has some
noise.
The context does not seem to help much as the most important
features are local.
Error analysis shows that ambiguity is the main culprit: most of the
ill-segmented words are ambiguous.
When we tried our segmenter on data available online, the
results were obviously worse. The reason: Substandard
Orthography
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 23 / 47
24. Substandard Orthography
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 24 / 47
25. The different forms of hamza
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 25 / 47
26. The different forms of hamza
Part of the stem
Question word: Do, Does, Is, Has, etc ..
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 26 / 47
27. The different forms of hamza
Part of the stem. Cannot be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 27 / 47
28. The different forms of hamza
Part of the stem. Cannot be segmented.
Assimilated question word. Must be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 28 / 47
29. The different forms of hamza
Part of the stem. Cannot be segmented.
Accusative marker. May be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 29 / 47
30. Figure: Three forms of hamza forming minimal pairs
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 30 / 47
31. t vs. h
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 31 / 47
32. t vs. h
Part of the stem: Cannot be segmented
3rd Person singular pronoun. Must be segmented.
3rd Person singular possessive pronoun. Must be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 32 / 47
33. t vs. h
Almost always a singular feminine marker. May be segmented
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 33 / 47
34. Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 34 / 47
35. y vs. a
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 35 / 47
36. y vs. a
Part of the stem. Cannot be segmented
First person possessive pronoun. Must be segmented.
First person pronoun. Must be segmented.
Imperfective prefix. May be segmented
17 different functions in the Arabic Treebank.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 36 / 47
37. y vs. a
Part of the stem. Cannot be segmented.
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 37 / 47
38. Standardizing the Orthography
Data
Methods
Evaluation
Effect of Orthography Standardization on Segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 38 / 47
39. Standardizing the Orthography: Data
Most available data are sub-standard
Formal publications: serious newspapers, magazines and books are
usually standard
The IslamWeb Library published over 1000 books all of which
rigorously checked and proofread.
We substandrdize this data
We select a sub-corpus of 35,666,914 words
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 39 / 47
40. Representing substandard orthography
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 40 / 47
41. Handling Substandard Orthography
We use the same set of features as we do in segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 41 / 47
42. Orthography Standardization Results
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 42 / 47
43. Effect of Standardization on Segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 43 / 47
44. Stemming
Stemming can be derived from segmentation through a rule-based
system
Remove all the affixes. Whatever remains is the stem
Theoretically you need POS tagging to disambiguate some rare cases
of ambiguity
Practically, those cases are so rare the number never get affected
Stemming is at least as accurate as segmentation
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 44 / 47
45. Long-standing Problems & Possible Solutions
Problem Solution
Most of the errors in seg-
mentation are ambiguous
words
Widen the context to include n previ-
ous/following words, Add more data
(hard), synthetic data
Data imbalance in
orthography standardiza-
tion
Try several methods of under-
sampling for the dominant class(es)
Substandard sometimes
beats standardization
Still examining the problem
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 45 / 47
46. Ongoing & Future Research
Treat both segmentation and standardization as sequence to sequence
problems
Joint standardization and segmentation
Create artificial data for segmentation (successful first attempts)
Trasnfer learning with contextual embeddings
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 46 / 47
47. Thank You
Github Repo: https://github.com/zeeshansayyed/ArabicSOS Email:
Zeeshan: zasayyed@indiana.edu
Emad: e.mohamed2@wlv.ac.uk
thankyouallforcomingandhopeyoufoundthisuseful
pleasefeelfreetoasksuggestorcriticise
”there is no such thing as a stupid question, only
stupid answers”
Emad Mohamed & Zeeshan Sayyed SOS: Segmenter, Stemmer, and Orthography Standardizer for the Arabic Cultural HeritageMay 9, 2019 47 / 47