• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation
 

SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

on

  • 1,598 views

 

Statistics

Views

Total Views
1,598
Views on SlideShare
1,580
Embed Views
18

Actions

Likes
0
Downloads
3
Comments
0

2 Embeds 18

http://aflat.org 17
http://www.aflat.org 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation Presentation Transcript

    • SYNERGY A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking May 18th, 2010 Language Technologies Institute Carnegie Mellon University Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • NER for Resource-scarce languages: Overview Named Entity Recognition (NER) is the task of finding names in text, and optionally, classifying them into persons, organizations, locations, etc. Example: Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid. Vital first step for many problems such as parsing, co-reference resolution, translation, indexing and semantic analysis. State-of-art NER systems rely on machine learning. Requires large amounts of labeled training data. Costly and time-consuming. Upshot: For many widely spoken languages e.g. Swahili, no NER systems freely available. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Motivation Online Machine Translation (MT) systems such as Google Translate and Microsoft Bing Translator support many resource-scarce languages for which NER systems don’t exist. Google Translate supports Swahili ⇐⇒ English translation. Quality of translation is far from perfect. However, this might still be good enough for NER. Why? Observation: Named Entities (NEs) are preserved during Swahili ⇐⇒ English translation. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Motivation: Named Entity Preservation - Example All Named Entities shown in Red: Sample Swahili sentence: “Lundenga aliwataja mawakala ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na Ruvuma.” English version from Google Translate: “Lundenga mentioned shatuma agents who have a prayer from the regions of Dodoma Iringa Mwanza Mbeya and Ruvuma.” Sample English sentence: “President Obama advised German Chancellor Angela Merkel.” Swahili version from Google Translate: “Rais Obama wanashauriwa Ujerumani Angela Merkel.” Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • SYNERGY: Big Picture 1. Swahili text is translated to English. 2. The best off-the-shelf NER systems are applied to the resulting English text. 3. English NEs are mapped back to words in the Swahili text using word alignment. 4. Post-Processing using dictionaries to improve performance Observation: SYNERGY addresses the problem of NER for a new language by breaking it into three relatively easier problems. Observation: No longer need labeled training data. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Comparison to existing NER systems No Swahili NER system available, so how do we compare SYNERGY and its MT-based approach? Extend SYNERGY to perform NER for another language for which freely available NER systems do exist. We use Arabic, because it is typically considered a ‘hard’ language for which to do NER. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Resources - Machine Translation Google Translate for Swahili and Arabic text. Microsoft Bing Translator not yet ready for Arabic, doesn’t support Swahili. Would like to use other systems, but for these languages, none freely available. Advantage: As soon as new languages become available on any MT system, SYNERGY can be used with few changes. Can also use MT systems automatically generated from parallel data using freely available toolkits such as Moses. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Resources - Named Entities Swahili labeled data: Helsinki Corpus of Swahili. Arabic labeled data: ANERcorp. NER: Stanford’s CRF based NER system and UIUC’s LBJ NE Tagger. State-of-art in English NER. Use only off-the-shelf NERs to show that a good NER system can be developed for new languages quickly. Post-Processing: Swahili-English dictionary by Kamusi Project and the Linguistic Data Consortium’s Arabic TreeBank. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Architecture: 1 - Translation to English Perl interface to Google Translate. Translate each sentence individually. Many sentences too long for Google Translate API, need to be split. Named Entities unaffected by this. Produces a translated English document for each Swahili document. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Architecture: 2 - NER on Translated English Text SYNERGY uses a combination of Stanford and UIUC NER systems called Union. Why? Observation: Stanford and UIUC NER systems don’t always make the same mistakes. Idea: Exploit this to improve performance. System Precision Recall F1 Stanford 0.962 0.963 0.963 UIUC 0.968 0.964 0.966 Union 0.952 0.985 0.968 Table: Performance of NER systems on CoNLL 2003 Test Set Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Architecture: 3 - Word Alignment Alignment of English NEs with words in original Swahili document to discover Swahili NEs. Most challenging component of SYNERGY. We try two different algorithms: 1. Window Based Alignment. 2. GIZA++ Based Alignment. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Architecture: 3 - Word Alignment - Window Based Initial Approach: 1. Translate each English NE word back to a Swahili word 2. Search in the original Swahili document for a match within a window of words. Produces very few matches. Modification: 1. Create a new English document by translating each Swahili word one at a time 2. Search in this new document for English NE words within a window of words. 3. Keep pointers from each English word in new document to Swahili word that produces it. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Architecture: 3 - Word Alignment - GIZA++ Based GIZA++ is state-of-art for word alignment in Machine Translation. 1. Takes an input language corpus and an output language corpus, and automatically generates word classes for both. 2. For each sentence in the input language corpus finds the most probable alignment of the corresponding output language sentence. SYNERGY: The translated English documents serve as the input corpus and the Swahili documents as the output corpus. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Architecture: 4 - Post-Processing Compile POS tagged English, Swahili and Arabic dictionaries. Divide into Named Entity (NE) and Non-Named Entity (NNE) sections. A word may occur in both sections. 1. If a word labeled as a Named Entity occurs in the NNE section of a dictionary but not in its NE section, the label is removed. 2. If a word not labeled as a Named Entity occurs in the NE section of a dictionary but not in its NNE section, the word is labeled as a Named Entity. Yields significant improvement in NER performance. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Results Version Prec Recl F1 Window w/o PP 0.676 0.694 0.685 Window with PP 0.818 0.704 0.757 GIZA++ w/o PP 0.534 0.900 0.670 GIZA++ with PP 0.754 0.886 0.815 Table: Results for Swahili NER System Prec Recl F1 SYNERGY Window w/o PP 0.680 0.502 0.578 SYNERGY Window with PP 0.848 0.600 0.703 SYNERGY GIZA++ w/o PP 0.530 0.702 0.603 SYNERGY GIZA++ with PP 0.761 0.817 0.788 Benajiba and Rosso, 2008 0.869 0.727 0.792 Table: Results for Arabic NER Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Error Analysis Three possible types of errors: Named Entities that are lost during translation from the source language to English. Errors made by the English NER module of SYNERGY A correctly recognized English NE word that gets mapped to the wrong word in the source language document during alignment Analysis of distribution of SYNERGY’s errors requires NE gold standard data for the English translations of our Swahili and Arabic test sets in addition to native gold standard data. Since no other known systems employ an MT based approach to NER, such data is not currently available. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Example - Swahili True NEs in Bold, System output in Red. Sample Swahili sentence: “Lundenga aliwataja mawakala ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na Ruvuma.” Translated English sentence: “Lundenga mentioned shatuma agents who have a prayer from the regions of Dodoma Iringa Mwanza Mbeya and Ruvuma.” NE labeled English sentence: “Lundenga mentioned shatuma agents who have a prayer from the regions of Dodoma Iringa Mwanza Mbeya and Ruvuma.” After GIZA++ alignment and post-processing: “Lundenga aliwataja mawakala ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na Ruvuma.” Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Observations Initial motivation validated by SYNERGY’s results: Although there are many inaccuracies in both the Swahili and Arabic translations, a vast majority of NE words are preserved across translation and successfully recognized by an English NER system. Performing English NER helps us avoid difficulties inherent in native Swahili and Arabic NER, e.g. ambiguous function words, recognizing clitics, etc. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Future work - Co-reference Resolution Determining if two entity mentions in a text refer to the same entity in real world or not. Three types of entity mentions: 1. Named mentions (i.e. NEs) (e.g. Barack Obama) 2. Nominal mentions (e.g. President of United States) 3. Pronominal mentions (e.g. he). Poor Results. Why? Ans: Unlike Named Entities, nominal and pronominal mentions not preserved during MT. Possible area of future Research. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Conclusion Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili. SYNERGY’s F1 score for Arabic comes very close to the state-of-the-art. SYNERGY’s F1 score for Swahili is a good first effort in the field of Swahili NER. Freely available at http://www.cs.cmu.edu/~encore/. We hope it will be a valuable tool for researchers wishing to work with Swahili text. Intend to use SYNERGY to perform NER for other resource-scarce languages supported by online translators. Related Talk: “ENCORE: Experiments with a Synthetic Entity Co-reference Resolution Tool” by Bo Lin, LREC Workshop on Resources and Evaluation for Entity Resolution and Entity Management (W16), May 22, 2010. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • And we’re done. Thank You! Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
    • Example - Arabic Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrp l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.” NE labeled English sentence: “Refer to this phenomenon because it is the atmosphere with a team of President George W. Bush Rice and Ambassador John Bolton at the United Nations which will recognize a lot of elements that might make some Europeans to deter the persistence of the United States decision to invest in the Israeli aggression on Lebanon.” After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.” Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY