Methods for Amharic Part-of-Speech Tagging

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Methods for Amharic Part-of-Speech Tagging - Presentation Transcript

    1. Methods for Amharic Part of Speech Tagging Björn Gambäck, Fredrik Olsson, Atelach Alemu Argaw, Lars Asker Amharic Amharic Experiments and Results • is used for country-wide communication in Ethiopia. Corpus • is spoken by about 30 million people as a first or second Data Set language. 210k words • is a Semitic language written 1065 Amharic news articles from left to right. • uses a unique script (fidel) Tag Set which has has 33 basic forms 30 tags – Full tagset by the Ethiopian Languages Research and 33 * 7 syllographs Center (ELRC) [Demeke and Getachew, 2006] • has a rich verbal morphology 11 tags – Basic tagset by ELRC based on triconsonantal 10 tags – Alternative tagset by Sisay [Fissaha, 2005] roots. ’ • Subject, gender, number, Tagged Corpus etc., are indicated as bound • Cleaned morphemes. • 200,863 words • nouns (and adjectives) can be inflected for gender, 10 Fold average statistics number, definiteness, etc Words Known Unknown 20,086 17,727 2,359 E.g. sbr : verb forms 88.26% 11.74% form pattern Root sbr CCC Results perfect säbbär CVCCVC Imperfect säbr CVCC gerund säbr CVCC ELRC BASIC SISAY imperative sbär CCVC TnT 85.56 92.55 92.60 causative assäbbär as-CVCCVC STD DEV 0.42 0.31 0.32 passive täsäbbär täs-CVCCVC KNOWN 90.00 93.95 93.99 UNKNOWN 52.13 82.06 82.20 SVMTool 88.30 92.77 92.80 Open Source STD DEV 0.41 0.31 0.37 Taggers KNOWN 89.58 93.37 93.34 UNKNOWN 78.68 88.23 88.74 Own folds 88.69 92.97 92.99 • TnT STD DEV 0.33 0.17 0.26 • Hidden Markov Model • Viterbi algorithm MaxEnt 87.87 92.56 92.60 • Maximize STD DEV 0.49 0.38 0.43 P(wordn|tagn)*P(tagn|tag1...n−1) KNOWN 89.44 93.26 93.27 • SVMTool UNKNOWN 76.05 87.29 87.61 • Support Vector Machines Own folds 90.83 94.64 94.52 • High dimensional vectors STD DEV 1.37 1.11 0.69 • Hyperplane separation BASELINE 35.50 58.26 59.61 algorithm • MALLET • Maximum Entropy Discussion • Linear classifier • Log likelihood maximization • TnT – Best performance for Known words • SVMTool – Best performance for unknown words and overall Reported performance of the • MaxEnt – Best performance when it uses own folds taggers (Wall Street Journal) Tagger Performance Future Work Overall Known Unknown • Morphological analysis TnT 96.7% 97.0% 85.5% • Combining Taggers • Use external knowledge sources (e.g. machine readable dictionaries) SVMTool 96.9 % 97.2 % 83.5 % • Semi supervised / unsupervised learning Mallet 96.6 % NA NA
    SlideShare Zeitgeist 2009

    + Guy De PauwGuy De Pauw Nominate

    custom

    289 views, 0 favs, 1 embeds more stats

    by Björn Gambäck, Fredrik Olsson, Atelach Alemu A more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 289
      • 283 on SlideShare
      • 6 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 1
    Most viewed embeds
    • 6 views on http://www.aflat.org

    more

    All embeds
    • 6 views on http://www.aflat.org

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories