Arabic morphology and POS-tagging <br />A short intro with a couple of demonstrations<br />11/19/2008<br />1<br />UW CLMA ...
Outline<br />Arabic morphology: overview of the problem<br />Prior Art with demonstration of Buckwalter’sAraMorph<br />Ske...
Arabic morphology: overview of the problem<br />Short vowels are not represented<br />The contrast between diphthongs and ...
Arabic morphology: overview (cont.)<br />Some examples (glossing over a lot of detail):<br />شاهد الرجل الفيلم فرجع إلى ال...
Regular expressions for orthographic words<br />(conj)?(enclitic_preposition)? noun_stem (plural)(possesive_pronoun)<br />...
Inherent ambiguity<br />Some strings with multiple analyses<br />فقد== fqd : either the verb<br />fqd = he lost 		<br />OR...
Other issues beyond the scope of this talk<br />Arabic spans 14 centuries and 22 countries<br />Is the liturgical language...
Prior Art<br />Buckwalter’sAramorph from LDC (a port from work done @ Xerox)<br />Ported to Java on top of Lucene(!) by Pi...
And now a demonstration of Aramorph<br />The point here is that most word strings have more than one legal analysis.<br />...
A few words WRT AraMorph<br />AraMorph will generate all the legal analyses for which it has an entry in its lexicon<br />...
Enhancements to Aramorph<br />I build this POS tagger in stages on top of  PierrickBrihaye’s port of ofAraMorph<br />The f...
Architecture (as it evolved)<br />With a 5-word sliding window <br />generate all sequences of segmentations for that 5-wo...
This bears some similarity to other work done in 2005<br />Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphologic...
How well does the POS tagger perform?<br />Good question, still TBD<br />I meant to pull out some of the training data and...
First: a word from my sponsor<br />I’m allowed to talk about this system<br />I was told that I could expose its functiona...
The demos<br />Tag to Buckwalter transliteration output<br />Tag to enamex style tags<br />Tag to <br />Utf8 arabic<br />R...
Future directions<br /><ul><li>Any further work will require me to rebuild everything from scratch
Uncouple it from Lucene
Port it to c++ or c#
Upcoming SlideShare
Loading in …5
×

Arabic morphology and POS-tagging

1,171
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,171
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • JavataggerJavatagger –eJavatagger –v –saramorphclientjaframe
  • Arabic morphology and POS-tagging

    1. 1. Arabic morphology and POS-tagging <br />A short intro with a couple of demonstrations<br />11/19/2008<br />1<br />UW CLMA <br />
    2. 2. Outline<br />Arabic morphology: overview of the problem<br />Prior Art with demonstration of Buckwalter’sAraMorph<br />Sketch of enhancements to AraMorph<br />Demonstration<br />Future directions<br />11/19/2008<br />UW CLMA <br />2<br />
    3. 3. Arabic morphology: overview of the problem<br />Short vowels are not represented<br />The contrast between diphthongs and long vowels is not represented<br />Most closed class morphemes are written as affixes to the content word categories: Nouns, Adjectives, Verbs and prepositions<br />11/19/2008<br />UW CLMA <br />3<br />
    4. 4. Arabic morphology: overview (cont.)<br />Some examples (glossing over a lot of detail):<br />شاهد الرجل الفيلم فرجع إلى البيت<br />$AhdalrjlAlfylmfrjE {lAAlbyt<br />$aaHada r-rajul-u l-fiylm-a fa-rajaEa {ilaA l-bayt-i<br />Saw-3sg.m. the-man-nom the-film-accand-so-returned-3sg.m to the-house-gen<br />The man watched the film and then went home.<br />This example is not so bad<br />11/19/2008<br />UW CLMA <br />4<br />
    5. 5. Regular expressions for orthographic words<br />(conj)?(enclitic_preposition)? noun_stem (plural)(possesive_pronoun)<br />(conj)?(definiteness marker)? noun_stem (plural)?<br />(conj)? full_word_preposition (genitive_pronoun)?<br />(conj)? complementizer (object_pron)?<br />(conj)? (modal)? (ImpVerbSubjAgr) verb_stem (plural_subject_marker)? (object_pronoun)? <br />(conj)? (modal)? verb_stem (perfVerbSubjAgr)? (object_pronoun)?<br />11/19/2008<br />UW CLMA <br />5<br />
    6. 6. Inherent ambiguity<br />Some strings with multiple analyses<br />فقد== fqd : either the verb<br />fqd = he lost <br />OR<br />f qd = and so (verbal modal)<br />fqdsmEth = فقد سمعته ; Can be analyzed as<br />a) f qdsmE t h (and so I had heard him)<br />b) fqdsmEp h (he lost his reputation)<br />11/19/2008<br />UW CLMA <br />6<br />
    7. 7. Other issues beyond the scope of this talk<br />Arabic spans 14 centuries and 22 countries<br />Is the liturgical language of over 1 billion Muslims<br />The Standard Language has never been a spoken variety.<br />The vernaculars have never been standardized.<br />The LDC corpus is the only annotated corpus that is readily available. The last time I looked the treebank part was less than a million tokens<br />11/19/2008<br />UW CLMA <br />7<br />
    8. 8. Prior Art<br />Buckwalter’sAramorph from LDC (a port from work done @ Xerox)<br />Ported to Java on top of Lucene(!) by PierrickBrihaye circa 2003 http://cvs.savannah.gnu.org/viewvc/aramorph<br />Tagset and segmentation description http://www.ldc.upenn.edu/Catalog/docs/LDC2003T06/POS-info.txt<br />Buckwalter’sTransliteration scheme http://www.qamus.org/transliteration.htm.<br />11/19/2008<br />UW CLMA <br />8<br />
    9. 9. And now a demonstration of Aramorph<br />The point here is that most word strings have more than one legal analysis.<br />The other point is that the number of types is quite high, unless you do something to reveal the content word behind all the function morpheme affixes.<br />Kitaab (book)<br />Al-kitaab (the book) <br />These two queries in Arabic return different sets of results on google<br />11/19/2008<br />UW CLMA <br />9<br />
    10. 10. A few words WRT AraMorph<br />AraMorph will generate all the legal analyses for which it has an entry in its lexicon<br />PierrickBrihaye ported AraMorph to Java<br />AraMorph is the first stage in a lot of Arabic text processing done by researchers in the US.<br />The Java port was done on top of Lucene, which is an open source indexing and IR system<br />11/19/2008<br />UW CLMA <br />10<br />
    11. 11. Enhancements to Aramorph<br />I build this POS tagger in stages on top of PierrickBrihaye’s port of ofAraMorph<br />The first thing I did was to port in a bigram model of segmented text from the LDC<br />This was used to choose the most likely segmentation sequence out of all of the analyses returned by Buckwalter’s analyzer<br />11/19/2008<br />UW CLMA <br />11<br />
    12. 12. Architecture (as it evolved)<br />With a 5-word sliding window <br />generate all sequences of segmentations for that 5-word window <br />based on all the analyses returned by AraMorph. <br />This scheme produced acceptable results<br />Sometime later a trigram model of the tags was added and <br />given 50% weight with the segmentation scores to decide which tags to keep with the segments<br />11/19/2008<br />UW CLMA <br />12<br />
    13. 13. This bears some similarity to other work done in 2005<br />Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL’05).<br />His team used Ripper (Cohen, 1996) to learn a rulebased classifier (Rip).<br />They also used AraMorph as their starting point to produce all legal morphological sequences.<br />http://www.mt-archive.info/ACL-2005-Habash-1.pdf<br />11/19/2008<br />UW CLMA <br />13<br />
    14. 14. How well does the POS tagger perform?<br />Good question, still TBD<br />I meant to pull out some of the training data and test it against a piece of the LDC corpus.<br />I ran out of time<br />Hand analysis puts it at better than 90%.<br />At some point I turned on the option to not toss the vowels provided by AraMorph.<br />This is observably less accurate<br />11/19/2008<br />UW CLMA <br />14<br />
    15. 15. First: a word from my sponsor<br />I’m allowed to talk about this system<br />I was told that I could expose its functionality on a website<br />I am not allowed to distribute it or use it for commercial purposes<br />There is an earlier tagger that does not inorporateLucene or AraMorph. It is based on Brill’s TB learning @<br />http://innerbrat.org/segmentTagDownload<br />11/19/2008<br />UW CLMA <br />15<br />
    16. 16. The demos<br />Tag to Buckwalter transliteration output<br />Tag to enamex style tags<br />Tag to <br />Utf8 arabic<br />Re-attaching the segments<br />Reduced tagset<br />Reloading the dictionary every time is annoying<br />Tag with a server and thin client<br />11/19/2008<br />UW CLMA <br />16<br />
    17. 17. Future directions<br /><ul><li>Any further work will require me to rebuild everything from scratch
    18. 18. Uncouple it from Lucene
    19. 19. Port it to c++ or c#
    20. 20. Bring in a statistical language model or two for recovering the short vowels.
    21. 21. Use some state-of-the-art machine learning toolkits to improve performance
    22. 22. Start annotating some of my corpora</li></ul>11/19/2008<br />UW CLMA <br />17<br />
    23. 23. Future directions<br /><ul><li>See if I can embed it in some practical applications such as
    24. 24. language teaching document production
    25. 25. preprocessing for
    26. 26. machine translation systems
    27. 27. preprocessing ASR
    28. 28. Text to speech
    29. 29. Bootstrap annotation tools for other Afro-Asiatic languages
    30. 30. Tigrinya, Somali, Hausa, Hebrew, Arabic vernaculars, Amharic, Amazigh, Coptic, Egyptian Hyroglyphs, Babylonian, Punic
    31. 31. Help with ODIN??</li></ul>11/19/2008<br />UW CLMA <br />18<br />
    32. 32. The end<br />11/19/2008<br />UW CLMA <br />19<br />

    ×