Handling Unknown Words in Arabic FST             Morphology        Khaled Shaalan and Mohammed Attia               Faculty...
Bird’s Eye viewProblem  • Out of Vocabulary words (OOV) cause a problem to    morphological analysers, parsers, MT, etc.  ...
Outline•   Introduction•   Morphological Guesser•   Methodology•   Testing and Evaluation•   Conclusion
Introduction• Why deal with unknown words?• Complexity of lemmatization in Arabic• Data used
IntroductionWhy deal with unknown words?• Language is always changing     • New words appear     • Old words disappear    ...
IntroductionComplexity of lemmatization in Arabic• Lemmatization means reducing words to their base  (canonical) forms    ...
IntroductionComplexity of lemmatization in Arabic          Proclitics              Prefix          Lemma Suffix           ...
IntroductionComplexity of lemmatization in Arabic                   Proclitics                   lemma  Suffix         Enc...
IntroductionData used• A large-scale corpus of 1,089,111,204  words  • 85% from the Arabic Gigaword Fourth Edition  • 15% ...
Morphological GuesserWe develop a morphological guesser forArabic unknown words that handles allpossible  • Clitics  • Pre...
GuesserLEXC       1                            LEXICON Adjectives======                                  +adj+fem         ...
MethodologyWe use a pipelined approach• First: a machine learning (SVM), context-sensitive tool  (MADA) is used to predict...
MethodologyExample‫والمسوِّ قون‬ َ     َ ُwa-Al-musaw~iquwna “and-the-marketers”MADA output:form:wAlmswqwn    num:p      g...
MethodologyResults• Corpus size is 1,089,111,204 tokens, 7,348,173  types• Unknown Types in the corpus: 2,116,180 (29%)• A...
Testing and EvaluationWe create a gold standard of 1,310 wordsmanually-annotated for:• Gold lemma• Gold POS• Lexical relev...
Testing and EvaluationEvaluating POS (accuracy)• Baseline: The most frequent tag (proper name)  for all unknown words: 45%...
Testing and EvaluationEvaluating Lemmatization (accuracy)• Baseline: new words appear in their base form:  45%• Pipelined ...
Testing and EvaluationEvaluating Lemma Weighting•   The weighting criteria aims to push lexicographically    relevant word...
Conclusion• We develop a methodology for automatically extracting  and lemmatizing unknown words in Arabic• We pipeline a ...
Upcoming SlideShare
Loading in...5
×

Fsmnlp presentation 02

109

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
109
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Fsmnlp presentation 02"

  1. 1. Handling Unknown Words in Arabic FST Morphology Khaled Shaalan and Mohammed Attia Faculty of Engineering and IT, The British University in Dubai Presented by Younes Samih Heinrich-Heine-Universität, Germany
  2. 2. Bird’s Eye viewProblem • Out of Vocabulary words (OOV) cause a problem to morphological analysers, parsers, MT, etc. • The manual extension of lexical databases is costly an time consuming. • With the large amount of data, manual extension of lexicons becomes practically impossible.Solution • Creating an automatic method for updating a lexical database • Integrating a Machine Learning method with a finite state guesser to lemmatize unknown words • Weighting new words by relevance and importance
  3. 3. Outline• Introduction• Morphological Guesser• Methodology• Testing and Evaluation• Conclusion
  4. 4. Introduction• Why deal with unknown words?• Complexity of lemmatization in Arabic• Data used
  5. 5. IntroductionWhy deal with unknown words?• Language is always changing • New words appear • Old words disappear • Unknown words make up 29% of the Gigaword corpus• Unknown words (OOV) always cause a problem to: • Morphological analysers • Parsers • Machine Translation & other applications
  6. 6. IntroductionComplexity of lemmatization in Arabic• Lemmatization means reducing words to their base (canonical) forms • played -> play studies - study • went -> go wives -> wife• New words in English appear in their base form 86% of the time (Lindén, 2008)• New words in Arabic appear in their base form 45% of the time• Arabic morphology is complex and semi-algorithmic: root, patterns, inflections, clitics, etc.
  7. 7. IntroductionComplexity of lemmatization in Arabic Proclitics Prefix Lemma Suffix EncliticConjunction/ Comp Tense/mood – Verb Tense/mood – Objectquestion article number/gend number/gend pronounConjunctions ‫ل و‬li ‘to’ Imperfective Imperfective First personwa ‘and’ or ‫ف‬fa tense (5) tense (10) (2)‘then’Question word ‫س أ‬sa ‘will’ Perfective tense lemma Perfective lemma Second᾽a ‘is it true that’ (1) tense (12) person (5) ‫ل‬la ‘then’ Imperative (2) Imperative (5) Third person (5)Possible Concatenations in Arabic Verbs ‫ شكر‬šakara ‘to thank’, generate 2,552 valid forms
  8. 8. IntroductionComplexity of lemmatization in Arabic Proclitics lemma Suffix Enclitic Conjunction/ Preposition Definite Noun Gender/Number Genitive question article article pronoun Conjunctions ‫ب و‬bi ‘with’, ‫ال‬al ‘the’ Masculine Dual First person wa ‘and’ or ‫ف‬ ‫ك‬ka ‘as’ (4) (2) fa ‘then’ or ‫ل‬li ‘to’ Feminine Dual (4) Question word ‫أ‬ Stem lemma Masculine Second person ᾽a ‘is it true regular plural (5) that’ (4) Feminine Third person regular plural (5) (1) Feminine Mark (1) ‫ معلم‬mu῾allim ‘teacher’, generate 519Possible Concatenations in Arabic Nouns valid forms
  9. 9. IntroductionData used• A large-scale corpus of 1,089,111,204 words • 85% from the Arabic Gigaword Fourth Edition • 15% from news articles crawled from the Al-Jazeera web site
  10. 10. Morphological GuesserWe develop a morphological guesser forArabic unknown words that handles allpossible • Clitics • Prefixes • Suffixes • And all relevant alteration operations that include insertion, assimilation, and deletion
  11. 11. GuesserLEXC 1 LEXICON Adjectives====== +adj+fem GuessWords; +adj+masc GuessWords;LEXICON Conjunctions ^ss^^‫سعيد‬se^+adj+masc+‫وـ‬conj:‫وـ‬ Prepositions; FemMascduFemduMascplFempl;+‫فـ‬conj:‫فـ‬ Prepositions; .... Prepositions; LEXICON GuessWordsLEXICON Prepositions ^ss^^GUESSNOUNSTEM^^se^+‫لـ‬prep:‫لـ‬ Article; FemMascduFemduMascplFempl;+‫كـ‬prep:‫كـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ FemMascduFemduFempl;+‫بـ‬prep:‫بـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ Article; FemMascduFemdu;LEXICON Article ….+‫الـ‬defArt Nouns; ALTERATION RULES 2+‫الـ‬defArt Adjectives; ================= Nouns; a -> b || L _ R Adjectives; XFST 3LEXICON Nouns =====+noun GuessWords; read regex < arb-Alphabet.txt define Alphabet^ss^^‫خادم‬se^ FemMascduMascpl; define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;.... substitute defined PossNounStem for "^GUESSNOUNSTEM^“
  12. 12. MethodologyWe use a pipelined approach• First: a machine learning (SVM), context-sensitive tool (MADA) is used to predict: • POS • Morpho-syntactic features of number, gender, person, tense, etc.• Second: The finite-state morphological guesser is used to produce all the possible interpretations of words and suggested lemmas.• Third: The two output are matched together and the agreed analysis is selected.
  13. 13. MethodologyExample‫والمسوِّ قون‬ َ َ ُwa-Al-musaw~iquwna “and-the-marketers”MADA output:form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:dFinite-state guesser output:‫والمسوقون‬ +adj‫+والمسوق‬Guess+masc+pl+nom@‫والمسوقون‬ +adj‫+والمسوقون‬Guess+sg@‫والمسوقون‬ +noun‫+والمسوق‬Guess+masc+pl+nom@‫والمسوقون‬ +noun‫+والمسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@
  14. 14. MethodologyResults• Corpus size is 1,089,111,204 tokens, 7,348,173 types• Unknown Types in the corpus: 2,116,180 (29%)• After spell checking, correctly spelt types are 208,188• Types with frequency of 10 or more: 40,277• After lemmatization:18,399 types
  15. 15. Testing and EvaluationWe create a gold standard of 1,310 wordsmanually-annotated for:• Gold lemma• Gold POS• Lexical relevance (include in a dictionary): yes or no Gold POS Type Count Ratio noun_prop 584 45%Among unknown words, noun 264 20%- Proper nouns are the most common adj 255 19%- Verbs are the least common verb 52 4%
  16. 16. Testing and EvaluationEvaluating POS (accuracy)• Baseline: The most frequent tag (proper name) for all unknown words: 45%• Mada: 60%• Voted POS Tagging: 69%. When a lemma gets a different POS tag with a higher frequency we take the higher Accuracy POS tagging 1 POS Tagging baseline 45% 2 MADA POS tagging 60% 3 Voted POS Tagging 69%
  17. 17. Testing and EvaluationEvaluating Lemmatization (accuracy)• Baseline: new words appear in their base form: 45%• Pipelined strict definite article ‘al’: 54%• Pipelined ignoring definite article ‘al’: 63% Lemmatization 1 Lemma first-order baseline 45% 2 Pipelined lemmatization (first- 54% order decision) with strict definite article matching 3 Pipelined lemmatization (first- 63% order decision) ignoring definite article matching
  18. 18. Testing and EvaluationEvaluating Lemma Weighting• The weighting criteria aims to push lexicographically relevant words up the list and less interesting words down.• We aim to make the number of important words high in the top 100 and low in the bottom 100Word Weight = ((number ofsister forms * 800) + Good words In top In bottomfrequencies of sister forms) / 2 + 100 100POS factor relying on Frequency 63 50 alone (baseline) relying on number of 87 28 sister forms * 800 relying on POS factor 58 30 using combined criteria 78 15
  19. 19. Conclusion• We develop a methodology for automatically extracting and lemmatizing unknown words in Arabic• We pipeline a finite-state guesser with a machine learning tool for lemmatization• We develop a weighting mechanism for predicting the relevance and importance of lemmas• Out of 2,116,180 unknown words, we create a lexicon of 18,399 lemmatized, POS-tagged and weighted entries.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×