Handling Unknown Words in Arabic FST Morphology Khaled Shaalan and Mohammed Attia Faculty of Engineering and IT, The British University in Dubai Presented by Younes Samih Heinrich-Heine-Universität, Germany
Bird’s Eye viewProblem • Out of Vocabulary words (OOV) cause a problem to morphological analysers, parsers, MT, etc. • The manual extension of lexical databases is costly an time consuming. • With the large amount of data, manual extension of lexicons becomes practically impossible.Solution • Creating an automatic method for updating a lexical database • Integrating a Machine Learning method with a finite state guesser to lemmatize unknown words • Weighting new words by relevance and importance
Outline• Introduction• Morphological Guesser• Methodology• Testing and Evaluation• Conclusion
Introduction• Why deal with unknown words?• Complexity of lemmatization in Arabic• Data used
IntroductionWhy deal with unknown words?• Language is always changing • New words appear • Old words disappear • Unknown words make up 29% of the Gigaword corpus• Unknown words (OOV) always cause a problem to: • Morphological analysers • Parsers • Machine Translation & other applications
IntroductionComplexity of lemmatization in Arabic• Lemmatization means reducing words to their base (canonical) forms • played -> play studies - study • went -> go wives -> wife• New words in English appear in their base form 86% of the time (Lindén, 2008)• New words in Arabic appear in their base form 45% of the time• Arabic morphology is complex and semi-algorithmic: root, patterns, inflections, clitics, etc.
IntroductionComplexity of lemmatization in Arabic Proclitics Prefix Lemma Suffix EncliticConjunction/ Comp Tense/mood – Verb Tense/mood – Objectquestion article number/gend number/gend pronounConjunctions ل وli ‘to’ Imperfective Imperfective First personwa ‘and’ or فfa tense (5) tense (10) (2)‘then’Question word س أsa ‘will’ Perfective tense lemma Perfective lemma Second᾽a ‘is it true that’ (1) tense (12) person (5) لla ‘then’ Imperative (2) Imperative (5) Third person (5)Possible Concatenations in Arabic Verbs شكرšakara ‘to thank’, generate 2,552 valid forms
IntroductionComplexity of lemmatization in Arabic Proclitics lemma Suffix Enclitic Conjunction/ Preposition Definite Noun Gender/Number Genitive question article article pronoun Conjunctions ب وbi ‘with’, الal ‘the’ Masculine Dual First person wa ‘and’ or ف كka ‘as’ (4) (2) fa ‘then’ or لli ‘to’ Feminine Dual (4) Question word أ Stem lemma Masculine Second person ᾽a ‘is it true regular plural (5) that’ (4) Feminine Third person regular plural (5) (1) Feminine Mark (1) معلمmu῾allim ‘teacher’, generate 519Possible Concatenations in Arabic Nouns valid forms
IntroductionData used• A large-scale corpus of 1,089,111,204 words • 85% from the Arabic Gigaword Fourth Edition • 15% from news articles crawled from the Al-Jazeera web site
Morphological GuesserWe develop a morphological guesser forArabic unknown words that handles allpossible • Clitics • Prefixes • Suffixes • And all relevant alteration operations that include insertion, assimilation, and deletion
MethodologyWe use a pipelined approach• First: a machine learning (SVM), context-sensitive tool (MADA) is used to predict: • POS • Morpho-syntactic features of number, gender, person, tense, etc.• Second: The finite-state morphological guesser is used to produce all the possible interpretations of words and suggested lemmas.• Third: The two output are matched together and the agreed analysis is selected.
MethodologyResults• Corpus size is 1,089,111,204 tokens, 7,348,173 types• Unknown Types in the corpus: 2,116,180 (29%)• After spell checking, correctly spelt types are 208,188• Types with frequency of 10 or more: 40,277• After lemmatization:18,399 types
Testing and EvaluationWe create a gold standard of 1,310 wordsmanually-annotated for:• Gold lemma• Gold POS• Lexical relevance (include in a dictionary): yes or no Gold POS Type Count Ratio noun_prop 584 45%Among unknown words, noun 264 20%- Proper nouns are the most common adj 255 19%- Verbs are the least common verb 52 4%
Testing and EvaluationEvaluating POS (accuracy)• Baseline: The most frequent tag (proper name) for all unknown words: 45%• Mada: 60%• Voted POS Tagging: 69%. When a lemma gets a different POS tag with a higher frequency we take the higher Accuracy POS tagging 1 POS Tagging baseline 45% 2 MADA POS tagging 60% 3 Voted POS Tagging 69%
Testing and EvaluationEvaluating Lemmatization (accuracy)• Baseline: new words appear in their base form: 45%• Pipelined strict definite article ‘al’: 54%• Pipelined ignoring definite article ‘al’: 63% Lemmatization 1 Lemma first-order baseline 45% 2 Pipelined lemmatization (first- 54% order decision) with strict definite article matching 3 Pipelined lemmatization (first- 63% order decision) ignoring definite article matching
Testing and EvaluationEvaluating Lemma Weighting• The weighting criteria aims to push lexicographically relevant words up the list and less interesting words down.• We aim to make the number of important words high in the top 100 and low in the bottom 100Word Weight = ((number ofsister forms * 800) + Good words In top In bottomfrequencies of sister forms) / 2 + 100 100POS factor relying on Frequency 63 50 alone (baseline) relying on number of 87 28 sister forms * 800 relying on POS factor 58 30 using combined criteria 78 15
Conclusion• We develop a methodology for automatically extracting and lemmatizing unknown words in Arabic• We pipeline a finite-state guesser with a machine learning tool for lemmatization• We develop a weighting mechanism for predicting the relevance and importance of lemmas• Out of 2,116,180 unknown words, we create a lexicon of 18,399 lemmatized, POS-tagged and weighted entries.