Floating dict presentation_04


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Floating dict presentation_04

  1. 1. The Floating Arabic Dictionary: AnAutomatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words Mohammed Attia, Younes Samih, Khaled Shaalan and Josef van Genabith Faculty of Engineering and IT, The British University in Dubai Heinrich-Heine-Universität, Germany School of Computing, Dublin City University, Ireland
  2. 2. Outline• Introduction• Morphological Guesser• Methodology• Testing and Evaluation• Conclusion
  3. 3. Introduction• Why deal with unknown words?• Complexity of lemmatization in Arabic• Data used
  4. 4. IntroductionA living language is just… living… dynamic… constantly changing… new words appear… old words die out… some words are seasonal… some are core
  5. 5. Introduction
  6. 6. IntroductionWhy deal with unknown words?• Language is always changing • New words appear • Old words disappear • Unknown words make up 29% of the Gigaword corpus• Unknown words (OOV) always cause a problem to: • Morphological analysers • Parsers • Machine Translation & other applications
  7. 7. Review of Arabic lexicographic workKitab al-Ain by al-Khalil bin Ahmed al-Farahidi (died 789) (refinement/expansion/organizational Improvement) ▼• Tahzib al-Lughah by Abu Mansour al-Azhari (died 980)• al-Muheet by al-Sahib bin Abbad (died 995)• Lisan al-Arab by ibn Manzour (died 1311)• al-Qamous al-Muheet by al-Fairouzabadi (died 1414)• Taj al-Arous by Muhammad Murtada al-Zabidi (died 1791)• Muheet al-Muheet (1869) by Butrus al-Bustani• al-Mujam al-Waseet (1960)• Buckwalter Arabic Morphological Analyzer (2002) Size: 40,222 lemmas (including 2,034 proper nouns) Includes many obsolete lexical items Many modern words are missed out
  8. 8. ‫‪Review of Arabic lexicographic work‬‬ ‫‪Buckwalter obsolete words: 8,400 obsolete words‬‬ ‫رمل )‪ :(sand‬ه َيالن وعْ س مِيعاس عِ ْث َير‬ ‫َ‬ ‫َ‬ ‫صحراء )‪ :(desert‬فيْفاء فدفد قواء م ْوماة م ْتلف سبْسب‬ ‫َ َ‬ ‫َ َ‬ ‫َ‬ ‫َْ َ َ‬ ‫َ‬ ‫سرج )‪ :(saddle‬حِداجة مخلُوفة‬ ‫َ َ ْ َ‬ ‫ْ‬ ‫َ‬ ‫حِمل )‪ :(load‬ظعِي َنة حِدج‬ ‫ْ‬ ‫ِْ‬ ‫ظعُون وقر‬ ‫َ‬ ‫ََ‬ ‫لجام )‪ :(bridle‬فِدام كعم كِعام‬‫أَرْ َن َبة شكم غِ مامة‬ ‫َ‬ ‫ََ‬ ‫َّ‬ ‫راكب )‪ :(rider‬حداء‬ ‫جمل )‪ :(camel‬هجي َنة‬ ‫َ ِ‬ ‫ِ ِّ ِ ْ‬ ‫رداء )‪ :(gown‬دفيَّة بش َتة‬ ‫َْ‬ ‫ِ َ‬ ‫حذاء )‪ :(shoes‬مبْذل َبشمق‬ ‫َ َ َ‬ ‫َ‬ ‫َ‬‫زرْ بُول زرْ بُون صرْ مة قبْقاب‬
  9. 9. ‫‪Review of Arabic lexicographic work‬‬ ‫‪Not in Dictionaries: about 10,000 need to be added‬‬‫سياسة: أمننة شرعنة أفروعربية إثني إقصائي تسييس محاصصة جبهوي جمهوعسكرية العصبوية شخصنة أمركة‬ ‫عصرنة‬ ‫تكنولوجيا:‬ ‫َْ‬ ‫رقمنة، أتمتة، مك َننة‬ ‫فيس بوك، تويتر، تغريدة‬ ‫هاتف جوال تليفون محمول‬ ‫الب توب‬ ‫الهواتف الذكية‬ ‫حوسبة‬ ‫بريد إلكتروني‬ ‫دي في دي، سي دي‬ ‫سبام، فيروس‬ ‫ملتي ميديا‬ ‫كمبيوتر لوحي، شاشة لمسية‬ ‫شيفرة‬‫اقتصاد: خصخصة ريعي يورو بورصة تعويم داو_جونز تضخم أسهم قيمة_دفترية مليار ترليون تجارة_إلكترونية‬
  10. 10. Review of Arabic lexicographic work Not in Dictionaries: about 10,000 need to be addedPolitics: legalizing, Afro-Arab, ethnic, ostracizing, Americanize, modernize Technology: Digitalizing, automating, Mechanizing Facebook, twitter, tweet Mobile phone Laptop Smartphone Computerizing Email DVD, CD Span, virus Multimedia Tablet PC, touch screenEconomy: privatization, Euro, inflation, Billion, Trillion, e-commerce
  11. 11. FloatingDictionary Introduction
  12. 12. IntroductionComplexity of lemmatization in Arabic• Lemmatization means reducing words to their base (canonical) forms • played -> play studies - study • went -> go wives -> wife• New words in English appear in their base form 86% of the time (Lindén, 2008)• New words in Arabic appear in their base form 45% of the time• Arabic morphology is complex and semi-algorithmic: root, patterns, inflections, clitics, etc.
  13. 13. Introduction
  14. 14. Introduction ‫وسيشكرونه‬ wasayashkurunahu wa@sa@yashkuruna@huComplexity of lemmatization in Arabic and@will@thank[they]@him Proclitics Prefix Lemma Suffix EncliticConjunction/ Comp Tense/mood – Verb Tense/mood – Objectquestion article number/gend number/gend pronounConjunctions ‫ل و‬li ‘to’ Imperfective Imperfective First personwa ‘and’ or ‫ف‬fa tense (5) tense (10) (2)‘then’Question word ‫س أ‬sa ‘will’ Perfective tense lemma Perfective lemma Second᾽a ‘is it true that’ (1) tense (12) person (5) ‫ل‬la ‘then’ Imperative (2) Imperative (5) Third person (5)Possible Concatenations in Arabic Verbs ‫ شكر‬šakara ‘to thank’, generate 2,552 valid forms
  15. 15. Introduction ‫وللمدرسين‬ walilmudarrisiyna wa@li@al@mudarrisiynaComplexity of lemmatization in Arabic and@to@the@teachers Proclitics lemma Suffix Enclitic Conjunction/ Preposition Definite Noun Gender/Number Genitive question article article pronoun Conjunctions ‫ب و‬bi ‘with’, ‫ال‬al ‘the’ Masculine Dual First person wa ‘and’ or ‫ف‬ ‫ك‬ka ‘as’ (4) (2) fa ‘then’ or ‫ل‬li ‘to’ Feminine Dual (4) Question word ‫أ‬ Stem lemma Masculine Second person ᾽a ‘is it true regular plural (5) that’ (4) Feminine Third person regular plural (5) (1) Feminine Mark (1) ‫ مدرس‬mudarris ‘teacher’, generate 519Possible Concatenations in Arabic Nouns valid forms
  16. 16. IntroductionDifference between stemming and lemmatizing ‫وسيقولونها‬ wa-sa-ya+quwl+uwna-ha and they will say it Stemming Lemmatizing quwl qAla ‫قول‬ ‫قال‬ Alteration rules
  17. 17. IntroductionData used• A large-scale corpus of 1,089,111,204 words • 85% from the Arabic Gigaword Fourth Edition • 15% from news articles crawled from the Al- Jazeera web siteIf printed on paper, it will be more than 2 times the height of EiffelTower= 16,000 large books= 640 meters of bookshelvesAvr reader reads 200 wpm with 60% comprehension.You will need 11 years 24/7 to read the Gigaword corpusTechnical issues:20-30 days to analyze with MADA using 10 parrallel sessions.You will need a machine with 256GB RAM to read 3-,4-. Or 5-gram language model of the Arabic Gigaword
  18. 18. Morphological GuesserWe develop a morphological guesser forArabic unknown words that handles allpossible • Clitics • Prefixes • Suffixes • And all relevant alteration operations that include insertion, assimilation, and deletion
  19. 19. GuesserLEXC 1 LEXICON Adjectives====== +adj+fem GuessWords; +adj+masc GuessWords;LEXICON Conjunctions ^ss^^‫سعيد‬se^+adj+masc+‫وـ‬conj:‫وـ‬ Prepositions; FemMascduFemduMascplFempl;+‫فـ‬conj:‫فـ‬ Prepositions; .... Prepositions; LEXICON GuessWordsLEXICON Prepositions ^ss^^GUESSNOUNSTEM^^se^+‫لـ‬prep:‫لـ‬ Article; FemMascduFemduMascplFempl;+‫كـ‬prep:‫كـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ FemMascduFemduFempl;+‫بـ‬prep:‫بـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ Article; FemMascduFemdu;LEXICON Article ….+‫الـ‬defArt Nouns; ALTERATION RULES 2+‫الـ‬defArt Adjectives; ================= Nouns; a -> b || L _ R Adjectives; XFST 3LEXICON Nouns =====+noun GuessWords; read regex < arb-Alphabet.txt define Alphabet^ss^^‫خادم‬se^ FemMascduMascpl; define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;.... substitute defined PossNounStem for "^GUESSNOUNSTEM^“
  20. 20. MethodologyWe use a pipeline-based approach• First: a machine learning (SVM), context-sensitive tool (MADA) is used to predict: • POS • Morpho-syntactic features of number, gender, person, tense, etc.• Second: The finite-state morphological guesser is used to produce all the possible interpretations of words and suggested lemmas.• Third: The two output are matched together and the agreed analysis is selected.
  21. 21. Methodology
  22. 22. MethodologyExample‫والمسوِّ قون‬ َ َ ُwa-Al-musaw~iquwna “and-the-marketers”MADA output:form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:dFinite-state guesser output:‫والمسوقون‬ +adj‫+والمسوق‬Guess+masc+pl+nom@‫والمسوقون‬ +adj‫+والمسوقون‬Guess+sg@‫والمسوقون‬ +noun‫+والمسوق‬Guess+masc+pl+nom@‫والمسوقون‬ +noun‫+والمسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@
  23. 23. MethodologyResults• Corpus size is 1,089,111,204 tokens, 7,348,173 types• Unknown Types in the corpus: 2,116,180 (29%)• After spell checking, correctly spelt types are 208,188• Types with frequency of 10 or more: 40,277• After lemmatization:18,399 types
  24. 24. Testing and Evaluation Gold POS Type Count RatioWe create a gold standard noun_prop 584 45 % noun 264 20 %of 1,310 words manually- adj 255 19 %annotated for: verb 52 4% noun_fem_plural 28 2%• Gold lemma (pluralia tantum)• Gold POS noun_broken_plural 28 2% others: 8 0.6 %• Lexical relevance (include in a noun_masc_plural (pluralia tantum) (4) part dictionary): yes or no (3) pron_dem (1)Among unknown words, Excluded misspelling 55 4%- Proper nouns are the most common not_known 15 1%- Verbs are the least common colloquial 19 1.5 % Lexicographic relevance Include in a dictionary 671 51 % Don’t include in a 639 49 % dictionary
  25. 25. Testing and EvaluationEvaluating POS (accuracy)• Baseline: The most frequent tag (proper name) for all unknown words: 45%• Mada: 60%• Voted POS Tagging: 69%. When a lemma gets a different POS tag with a higher frequency we take the higher Accuracy POS tagging 1 POS Tagging baseline 45% 2 MADA POS tagging 60% 3 Voted POS Tagging 69%
  26. 26. Testing and EvaluationEvaluating Lemmatization (accuracy)• Baseline: new words appear in their base form: 45%• Pipelined strict definite article ‘al’: 54%• Pipelined ignoring definite article ‘al’: 63% Lemmatization 1 Lemma first-order baseline 45% 2 Pipelined lemmatization (first- 54% order decision) with strict definite article matching 3 Pipelined lemmatization (first- 63% order decision) ignoring definite article matching
  27. 27. Testing and EvaluationEvaluating Lemma Weighting• The weighting criteria aims to push lexicographically relevant words up the list and less interesting words down.• We aim to make the number of important words high in the top 100 and low in the bottom 100Word Weight = ((number ofsister forms * 800) + Good words In top In bottomfrequencies of sister forms) / 2 + 100 100POS factor relying on Frequency 63 50 alone (baseline) relying on number of 87 28 sister forms * 800 relying on POS factor 58 30 using combined criteria 78 15
  28. 28. Testing and EvaluationOxford new words list: June 2012• BitTorrent: a protocol that underpins the practice of peer-to-peer file sharing• command line: a user interface that is navigated by typing commands• cybercast: A news or entertainment program transmitted over the Internet.• subcommunity: a distinct grouping within a community• subjectivization: to make subjective• subpersonality: a personality mode that kicks in (appears on a temporary basis) to allow a person to cope with certain types of psychosocial situations.• superglue v: to stick with superglue
  29. 29. Testing and EvaluationWords expected in the next Arabic dictionary/morphological analyser
  30. 30. Testing and Evaluation
  31. 31. Testing and Evaluation
  32. 32. Bird’s Eye viewProblem • Out of Vocabulary words (OOV) cause a problem to morphological analysers, parsers, MT, etc. • The manual extension of lexical databases is costly an time consuming. • With the large amount of data, manual extension of lexicons becomes practically impossible.Solution • Creating an automatic method for updating a lexical database • Integrating a Machine Learning method with a finite state guesser to lemmatize unknown words • Weighting new words by relevance and importance
  33. 33. Conclusion• We develop a methodology for automatically extracting and lemmatizing unknown words in Arabic• We pipeline a finite-state guesser with a machine learning tool for lemmatization• We develop a weighting mechanism for predicting the relevance and importance of lemmas• Out of 2,116,180 unknown words, we create a lexicon of 18,399 lemmatized, POS-tagged and weighted entries.