Your SlideShare is downloading. ×
Arabic morphology and POS-tagging
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Arabic morphology and POS-tagging

1,118

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,118
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • JavataggerJavatagger –eJavatagger –v –saramorphclientjaframe
  • Transcript

    • 1. Arabic morphology and POS-tagging
      A short intro with a couple of demonstrations
      11/19/2008
      1
      UW CLMA
    • 2. Outline
      Arabic morphology: overview of the problem
      Prior Art with demonstration of Buckwalter’sAraMorph
      Sketch of enhancements to AraMorph
      Demonstration
      Future directions
      11/19/2008
      UW CLMA
      2
    • 3. Arabic morphology: overview of the problem
      Short vowels are not represented
      The contrast between diphthongs and long vowels is not represented
      Most closed class morphemes are written as affixes to the content word categories: Nouns, Adjectives, Verbs and prepositions
      11/19/2008
      UW CLMA
      3
    • 4. Arabic morphology: overview (cont.)
      Some examples (glossing over a lot of detail):
      شاهد الرجل الفيلم فرجع إلى البيت
      $AhdalrjlAlfylmfrjE {lAAlbyt
      $aaHada r-rajul-u l-fiylm-a fa-rajaEa {ilaA l-bayt-i
      Saw-3sg.m. the-man-nom the-film-accand-so-returned-3sg.m to the-house-gen
      The man watched the film and then went home.
      This example is not so bad
      11/19/2008
      UW CLMA
      4
    • 5. Regular expressions for orthographic words
      (conj)?(enclitic_preposition)? noun_stem (plural)(possesive_pronoun)
      (conj)?(definiteness marker)? noun_stem (plural)?
      (conj)? full_word_preposition (genitive_pronoun)?
      (conj)? complementizer (object_pron)?
      (conj)? (modal)? (ImpVerbSubjAgr) verb_stem (plural_subject_marker)? (object_pronoun)?
      (conj)? (modal)? verb_stem (perfVerbSubjAgr)? (object_pronoun)?
      11/19/2008
      UW CLMA
      5
    • 6. Inherent ambiguity
      Some strings with multiple analyses
      فقد== fqd : either the verb
      fqd = he lost
      OR
      f qd = and so (verbal modal)
      fqdsmEth = فقد سمعته ; Can be analyzed as
      a) f qdsmE t h (and so I had heard him)
      b) fqdsmEp h (he lost his reputation)
      11/19/2008
      UW CLMA
      6
    • 7. Other issues beyond the scope of this talk
      Arabic spans 14 centuries and 22 countries
      Is the liturgical language of over 1 billion Muslims
      The Standard Language has never been a spoken variety.
      The vernaculars have never been standardized.
      The LDC corpus is the only annotated corpus that is readily available. The last time I looked the treebank part was less than a million tokens
      11/19/2008
      UW CLMA
      7
    • 8. Prior Art
      Buckwalter’sAramorph from LDC (a port from work done @ Xerox)
      Ported to Java on top of Lucene(!) by PierrickBrihaye circa 2003 http://cvs.savannah.gnu.org/viewvc/aramorph
      Tagset and segmentation description http://www.ldc.upenn.edu/Catalog/docs/LDC2003T06/POS-info.txt
      Buckwalter’sTransliteration scheme http://www.qamus.org/transliteration.htm.
      11/19/2008
      UW CLMA
      8
    • 9. And now a demonstration of Aramorph
      The point here is that most word strings have more than one legal analysis.
      The other point is that the number of types is quite high, unless you do something to reveal the content word behind all the function morpheme affixes.
      Kitaab (book)
      Al-kitaab (the book)
      These two queries in Arabic return different sets of results on google
      11/19/2008
      UW CLMA
      9
    • 10. A few words WRT AraMorph
      AraMorph will generate all the legal analyses for which it has an entry in its lexicon
      PierrickBrihaye ported AraMorph to Java
      AraMorph is the first stage in a lot of Arabic text processing done by researchers in the US.
      The Java port was done on top of Lucene, which is an open source indexing and IR system
      11/19/2008
      UW CLMA
      10
    • 11. Enhancements to Aramorph
      I build this POS tagger in stages on top of PierrickBrihaye’s port of ofAraMorph
      The first thing I did was to port in a bigram model of segmented text from the LDC
      This was used to choose the most likely segmentation sequence out of all of the analyses returned by Buckwalter’s analyzer
      11/19/2008
      UW CLMA
      11
    • 12. Architecture (as it evolved)
      With a 5-word sliding window
      generate all sequences of segmentations for that 5-word window
      based on all the analyses returned by AraMorph.
      This scheme produced acceptable results
      Sometime later a trigram model of the tags was added and
      given 50% weight with the segmentation scores to decide which tags to keep with the segments
      11/19/2008
      UW CLMA
      12
    • 13. This bears some similarity to other work done in 2005
      Habash, Nizar and Owen Rambow. Arabic Tokenization, Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the Conference of American Association for Computational Linguistics (ACL’05).
      His team used Ripper (Cohen, 1996) to learn a rulebased classifier (Rip).
      They also used AraMorph as their starting point to produce all legal morphological sequences.
      http://www.mt-archive.info/ACL-2005-Habash-1.pdf
      11/19/2008
      UW CLMA
      13
    • 14. How well does the POS tagger perform?
      Good question, still TBD
      I meant to pull out some of the training data and test it against a piece of the LDC corpus.
      I ran out of time
      Hand analysis puts it at better than 90%.
      At some point I turned on the option to not toss the vowels provided by AraMorph.
      This is observably less accurate
      11/19/2008
      UW CLMA
      14
    • 15. First: a word from my sponsor
      I’m allowed to talk about this system
      I was told that I could expose its functionality on a website
      I am not allowed to distribute it or use it for commercial purposes
      There is an earlier tagger that does not inorporateLucene or AraMorph. It is based on Brill’s TB learning @
      http://innerbrat.org/segmentTagDownload
      11/19/2008
      UW CLMA
      15
    • 16. The demos
      Tag to Buckwalter transliteration output
      Tag to enamex style tags
      Tag to
      Utf8 arabic
      Re-attaching the segments
      Reduced tagset
      Reloading the dictionary every time is annoying
      Tag with a server and thin client
      11/19/2008
      UW CLMA
      16
    • 17. Future directions
      • Any further work will require me to rebuild everything from scratch
      • 18. Uncouple it from Lucene
      • 19. Port it to c++ or c#
      • 20. Bring in a statistical language model or two for recovering the short vowels.
      • 21. Use some state-of-the-art machine learning toolkits to improve performance
      • 22. Start annotating some of my corpora
      11/19/2008
      UW CLMA
      17
    • 23. Future directions
      • See if I can embed it in some practical applications such as
      • 24. language teaching document production
      • 25. preprocessing for
      • 26. machine translation systems
      • 27. preprocessing ASR
      • 28. Text to speech
      • 29. Bootstrap annotation tools for other Afro-Asiatic languages
      • 30. Tigrinya, Somali, Hausa, Hebrew, Arabic vernaculars, Amharic, Amazigh, Coptic, Egyptian Hyroglyphs, Babylonian, Punic
      • 31. Help with ODIN??
      11/19/2008
      UW CLMA
      18
    • 32. The end
      11/19/2008
      UW CLMA
      19

    ×