SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation
1. SYNERGY
A Named Entity Recognition System for Resource-scarce
Languages such as Swahili using Online Machine Translation
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking
May 18th, 2010
Language Technologies Institute
Carnegie Mellon University
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
2. NER for Resource-scarce languages: Overview
Named Entity Recognition (NER) is the task of finding names
in text, and optionally, classifying them into persons,
organizations, locations, etc.
Example: Wolff, currently a journalist in Argentina, played
with Del Bosque in the final years of the seventies in Real
Madrid.
Vital first step for many problems such as parsing, co-reference
resolution, translation, indexing and semantic analysis.
State-of-art NER systems rely on machine learning.
Requires large amounts of labeled training data. Costly and
time-consuming.
Upshot: For many widely spoken languages e.g. Swahili, no
NER systems freely available.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
3. Motivation
Online Machine Translation (MT) systems such as Google
Translate and Microsoft Bing Translator support many
resource-scarce languages for which NER systems don’t exist.
Google Translate supports Swahili ⇐⇒ English translation.
Quality of translation is far from perfect.
However, this might still be good enough for NER. Why?
Observation: Named Entities (NEs) are preserved during
Swahili ⇐⇒ English translation.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
4. Motivation: Named Entity Preservation - Example
All Named Entities shown in Red:
Sample Swahili sentence: “Lundenga aliwataja mawakala
ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa
Dodoma Mbeya Mwanza na Ruvuma.”
English version from Google Translate: “Lundenga mentioned
shatuma agents who have a prayer from the regions of
Dodoma Iringa Mwanza Mbeya and Ruvuma.”
Sample English sentence: “President Obama advised German
Chancellor Angela Merkel.”
Swahili version from Google Translate: “Rais Obama
wanashauriwa Ujerumani Angela Merkel.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
5. SYNERGY: Big Picture
1. Swahili text is translated to English.
2. The best off-the-shelf NER systems are applied to the
resulting English text.
3. English NEs are mapped back to words in the Swahili text
using word alignment.
4. Post-Processing using dictionaries to improve performance
Observation: SYNERGY addresses the problem of NER for a
new language by breaking it into three relatively easier
problems.
Observation: No longer need labeled training data.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
6. Comparison to existing NER systems
No Swahili NER system available, so how do we compare
SYNERGY and its MT-based approach?
Extend SYNERGY to perform NER for another language for
which freely available NER systems do exist.
We use Arabic, because it is typically considered a ‘hard’
language for which to do NER.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
7. Resources - Machine Translation
Google Translate for Swahili and Arabic text. Microsoft Bing
Translator not yet ready for Arabic, doesn’t support Swahili.
Would like to use other systems, but for these languages, none
freely available.
Advantage: As soon as new languages become available on
any MT system, SYNERGY can be used with few changes.
Can also use MT systems automatically generated from
parallel data using freely available toolkits such as Moses.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
8. Resources - Named Entities
Swahili labeled data: Helsinki Corpus of Swahili.
Arabic labeled data: ANERcorp.
NER: Stanford’s CRF based NER system and UIUC’s LBJ NE
Tagger. State-of-art in English NER.
Use only off-the-shelf NERs to show that a good NER system
can be developed for new languages quickly.
Post-Processing: Swahili-English dictionary by Kamusi Project
and the Linguistic Data Consortium’s Arabic TreeBank.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
9. Architecture: 1 - Translation to English
Perl interface to Google Translate.
Translate each sentence individually.
Many sentences too long for Google Translate API, need to be
split. Named Entities unaffected by this.
Produces a translated English document for each Swahili
document.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
10. Architecture: 2 - NER on Translated English Text
SYNERGY uses a combination of Stanford and UIUC NER
systems called Union.
Why? Observation: Stanford and UIUC NER systems don’t
always make the same mistakes.
Idea: Exploit this to improve performance.
System Precision Recall F1
Stanford 0.962 0.963 0.963
UIUC 0.968 0.964 0.966
Union 0.952 0.985 0.968
Table: Performance of NER systems on CoNLL 2003 Test Set
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
11. Architecture: 3 - Word Alignment
Alignment of English NEs with words in original Swahili
document to discover Swahili NEs.
Most challenging component of SYNERGY.
We try two different algorithms:
1. Window Based Alignment.
2. GIZA++ Based Alignment.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
12. Architecture: 3 - Word Alignment - Window Based
Initial Approach:
1. Translate each English NE word back to a Swahili word
2. Search in the original Swahili document for a match within a
window of words.
Produces very few matches.
Modification:
1. Create a new English document by translating each Swahili
word one at a time
2. Search in this new document for English NE words within a
window of words.
3. Keep pointers from each English word in new document to
Swahili word that produces it.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
13. Architecture: 3 - Word Alignment - GIZA++ Based
GIZA++ is state-of-art for word alignment in Machine
Translation.
1. Takes an input language corpus and an output language
corpus, and automatically generates word classes for both.
2. For each sentence in the input language corpus finds the most
probable alignment of the corresponding output language
sentence.
SYNERGY: The translated English documents serve as the
input corpus and the Swahili documents as the output corpus.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
14. Architecture: 4 - Post-Processing
Compile POS tagged English, Swahili and Arabic dictionaries.
Divide into Named Entity (NE) and Non-Named Entity
(NNE) sections. A word may occur in both sections.
1. If a word labeled as a Named Entity occurs in the NNE section
of a dictionary but not in its NE section, the label is removed.
2. If a word not labeled as a Named Entity occurs in the NE
section of a dictionary but not in its NNE section, the word is
labeled as a Named Entity.
Yields significant improvement in NER performance.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
15. Results
Version Prec Recl F1
Window w/o PP 0.676 0.694 0.685
Window with PP 0.818 0.704 0.757
GIZA++ w/o PP 0.534 0.900 0.670
GIZA++ with PP 0.754 0.886 0.815
Table: Results for Swahili NER
System Prec Recl F1
SYNERGY Window w/o PP 0.680 0.502 0.578
SYNERGY Window with PP 0.848 0.600 0.703
SYNERGY GIZA++ w/o PP 0.530 0.702 0.603
SYNERGY GIZA++ with PP 0.761 0.817 0.788
Benajiba and Rosso, 2008 0.869 0.727 0.792
Table: Results for Arabic NER
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
16. Error Analysis
Three possible types of errors:
Named Entities that are lost during translation from the source
language to English.
Errors made by the English NER module of SYNERGY
A correctly recognized English NE word that gets mapped to
the wrong word in the source language document during
alignment
Analysis of distribution of SYNERGY’s errors requires NE gold
standard data for the English translations of our Swahili and
Arabic test sets in addition to native gold standard data.
Since no other known systems employ an MT based approach
to NER, such data is not currently available.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
17. Example - Swahili
True NEs in Bold, System output in Red.
Sample Swahili sentence: “Lundenga aliwataja mawakala
ambao wameshatuma maombi kuwa ni kutoka mikoa ya
Iringa Dodoma Mbeya Mwanza na Ruvuma.”
Translated English sentence: “Lundenga mentioned shatuma
agents who have a prayer from the regions of Dodoma Iringa
Mwanza Mbeya and Ruvuma.”
NE labeled English sentence: “Lundenga mentioned shatuma
agents who have a prayer from the regions of Dodoma Iringa
Mwanza Mbeya and Ruvuma.”
After GIZA++ alignment and post-processing: “Lundenga
aliwataja mawakala ambao wameshatuma maombi kuwa ni
kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na
Ruvuma.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
18. Observations
Initial motivation validated by SYNERGY’s results: Although
there are many inaccuracies in both the Swahili and Arabic
translations, a vast majority of NE words are preserved across
translation and successfully recognized by an English NER
system.
Performing English NER helps us avoid difficulties inherent in
native Swahili and Arabic NER, e.g. ambiguous function
words, recognizing clitics, etc.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
19. Future work - Co-reference Resolution
Determining if two entity mentions in a text refer to the same
entity in real world or not.
Three types of entity mentions:
1. Named mentions (i.e. NEs) (e.g. Barack Obama)
2. Nominal mentions (e.g. President of United States)
3. Pronominal mentions (e.g. he).
Poor Results. Why?
Ans: Unlike Named Entities, nominal and pronominal
mentions not preserved during MT.
Possible area of future Research.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
20. Conclusion
Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili.
SYNERGY’s F1 score for Arabic comes very close to the
state-of-the-art.
SYNERGY’s F1 score for Swahili is a good first effort in the
field of Swahili NER.
Freely available at http://www.cs.cmu.edu/~encore/. We
hope it will be a valuable tool for researchers wishing to work
with Swahili text.
Intend to use SYNERGY to perform NER for other
resource-scarce languages supported by online translators.
Related Talk: “ENCORE: Experiments with a Synthetic Entity
Co-reference Resolution Tool” by Bo Lin, LREC Workshop on
Resources and Evaluation for Entity Resolution and Entity
Management (W16), May 22, 2010.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
21. And we’re done.
Thank You!
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
22. Example - Arabic
Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrp
l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr
jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn
AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt
AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”
NE labeled English sentence: “Refer to this phenomenon because it
is the atmosphere with a team of President George W. Bush Rice
and Ambassador John Bolton at the United Nations which will
recognize a lot of elements that might make some Europeans to
deter the persistence of the United States decision to invest in the
Israeli aggression on Lebanon.”
After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhA
t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn
bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr
Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt
AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY