SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

SYNERGY
A Named Entity Recognition System for Resource-scarce
Languages such as Swahili using Online Machine Translation

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking

May 18th, 2010

Language Technologies Institute
Carnegie Mellon University
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY

NER for Resource-scarce languages: Overview
Named Entity Recognition (NER) is the task of finding names
in text, and optionally, classifying them into persons,
organizations, locations, etc.

Example: Wolff, currently a journalist in Argentina, played
with Del Bosque in the final years of the seventies in Real
Madrid.

Vital first step for many problems such as parsing, co-reference
resolution, translation, indexing and semantic analysis.

State-of-art NER systems rely on machine learning.

Requires large amounts of labeled training data. Costly and
time-consuming.

Upshot: For many widely spoken languages e.g. Swahili, no
NER systems freely available.

Motivation

Online Machine Translation (MT) systems such as Google
Translate and Microsoft Bing Translator support many
resource-scarce languages for which NER systems don’t exist.

Google Translate supports Swahili ⇐⇒ English translation.

Quality of translation is far from perfect.

However, this might still be good enough for NER. Why?

Observation: Named Entities (NEs) are preserved during
Swahili ⇐⇒ English translation.


Motivation: Named Entity Preservation - Example

All Named Entities shown in Red:
Sample Swahili sentence: “Lundenga aliwataja mawakala
ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa
Dodoma Mbeya Mwanza na Ruvuma.”

English version from Google Translate: “Lundenga mentioned
shatuma agents who have a prayer from the regions of
Dodoma Iringa Mwanza Mbeya and Ruvuma.”

Sample English sentence: “President Obama advised German
Chancellor Angela Merkel.”

Swahili version from Google Translate: “Rais Obama
wanashauriwa Ujerumani Angela Merkel.”


SYNERGY: Big Picture

1. Swahili text is translated to English.
2. The best oﬀ-the-shelf NER systems are applied to the
resulting English text.
3. English NEs are mapped back to words in the Swahili text
using word alignment.
4. Post-Processing using dictionaries to improve performance

Observation: SYNERGY addresses the problem of NER for a
new language by breaking it into three relatively easier
problems.
Observation: No longer need labeled training data.

Comparison to existing NER systems

No Swahili NER system available, so how do we compare
SYNERGY and its MT-based approach?

Extend SYNERGY to perform NER for another language for
which freely available NER systems do exist.

We use Arabic, because it is typically considered a ‘hard’
language for which to do NER.


Resources - Machine Translation

Google Translate for Swahili and Arabic text. Microsoft Bing
Translator not yet ready for Arabic, doesn’t support Swahili.

Would like to use other systems, but for these languages, none
freely available.

Advantage: As soon as new languages become available on
any MT system, SYNERGY can be used with few changes.

Can also use MT systems automatically generated from
parallel data using freely available toolkits such as Moses.


Resources - Named Entities

Swahili labeled data: Helsinki Corpus of Swahili.

Arabic labeled data: ANERcorp.

NER: Stanford’s CRF based NER system and UIUC’s LBJ NE
Tagger. State-of-art in English NER.

Use only oﬀ-the-shelf NERs to show that a good NER system
can be developed for new languages quickly.

Post-Processing: Swahili-English dictionary by Kamusi Project
and the Linguistic Data Consortium’s Arabic TreeBank.


Architecture: 1 - Translation to English

Perl interface to Google Translate.

Translate each sentence individually.

Many sentences too long for Google Translate API, need to be
split. Named Entities unaﬀected by this.

Produces a translated English document for each Swahili
document.


Architecture: 2 - NER on Translated English Text

SYNERGY uses a combination of Stanford and UIUC NER
systems called Union.
Why? Observation: Stanford and UIUC NER systems don’t
always make the same mistakes.
Idea: Exploit this to improve performance.
System Precision Recall F1
Stanford 0.962 0.963 0.963
UIUC 0.968 0.964 0.966
Union 0.952 0.985 0.968

Table: Performance of NER systems on CoNLL 2003 Test Set

Architecture: 3 - Word Alignment

Alignment of English NEs with words in original Swahili
document to discover Swahili NEs.

Most challenging component of SYNERGY.

We try two diﬀerent algorithms:

1. Window Based Alignment.

2. GIZA++ Based Alignment.

Architecture: 3 - Word Alignment - Window Based

Initial Approach:
1. Translate each English NE word back to a Swahili word
2. Search in the original Swahili document for a match within a
window of words.

Produces very few matches.
Modiﬁcation:
1. Create a new English document by translating each Swahili
word one at a time
2. Search in this new document for English NE words within a
window of words.
3. Keep pointers from each English word in new document to
Swahili word that produces it.


Architecture: 3 - Word Alignment - GIZA++ Based

GIZA++ is state-of-art for word alignment in Machine
Translation.

1. Takes an input language corpus and an output language
corpus, and automatically generates word classes for both.

2. For each sentence in the input language corpus ﬁnds the most
probable alignment of the corresponding output language
sentence.

SYNERGY: The translated English documents serve as the
input corpus and the Swahili documents as the output corpus.


Architecture: 4 - Post-Processing

Compile POS tagged English, Swahili and Arabic dictionaries.
Divide into Named Entity (NE) and Non-Named Entity
(NNE) sections. A word may occur in both sections.
1. If a word labeled as a Named Entity occurs in the NNE section
of a dictionary but not in its NE section, the label is removed.
2. If a word not labeled as a Named Entity occurs in the NE
section of a dictionary but not in its NNE section, the word is
labeled as a Named Entity.
Yields signiﬁcant improvement in NER performance.

Results
Version Prec Recl F1
Window w/o PP 0.676 0.694 0.685
Window with PP 0.818 0.704 0.757
GIZA++ w/o PP 0.534 0.900 0.670
GIZA++ with PP 0.754 0.886 0.815
Table: Results for Swahili NER

System Prec Recl F1
SYNERGY Window w/o PP 0.680 0.502 0.578
SYNERGY Window with PP 0.848 0.600 0.703
SYNERGY GIZA++ w/o PP 0.530 0.702 0.603
SYNERGY GIZA++ with PP 0.761 0.817 0.788
Benajiba and Rosso, 2008 0.869 0.727 0.792
Table: Results for Arabic NER


Error Analysis

Three possible types of errors:
Named Entities that are lost during translation from the source
language to English.
Errors made by the English NER module of SYNERGY
A correctly recognized English NE word that gets mapped to
the wrong word in the source language document during
alignment

Analysis of distribution of SYNERGY’s errors requires NE gold
standard data for the English translations of our Swahili and
Arabic test sets in addition to native gold standard data.

Since no other known systems employ an MT based approach
to NER, such data is not currently available.


Example - Swahili
True NEs in Bold, System output in Red.
Sample Swahili sentence: “Lundenga aliwataja mawakala
ambao wameshatuma maombi kuwa ni kutoka mikoa ya
Iringa Dodoma Mbeya Mwanza na Ruvuma.”

Translated English sentence: “Lundenga mentioned shatuma
agents who have a prayer from the regions of Dodoma Iringa
Mwanza Mbeya and Ruvuma.”

NE labeled English sentence: “Lundenga mentioned shatuma
agents who have a prayer from the regions of Dodoma Iringa
Mwanza Mbeya and Ruvuma.”

After GIZA++ alignment and post-processing: “Lundenga
aliwataja mawakala ambao wameshatuma maombi kuwa ni
kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na
Ruvuma.”

Observations

Initial motivation validated by SYNERGY’s results: Although
there are many inaccuracies in both the Swahili and Arabic
translations, a vast majority of NE words are preserved across
translation and successfully recognized by an English NER
system.

Performing English NER helps us avoid diﬃculties inherent in
native Swahili and Arabic NER, e.g. ambiguous function
words, recognizing clitics, etc.


Future work - Co-reference Resolution

Determining if two entity mentions in a text refer to the same
entity in real world or not.
Three types of entity mentions:
1. Named mentions (i.e. NEs) (e.g. Barack Obama)
2. Nominal mentions (e.g. President of United States)
3. Pronominal mentions (e.g. he).

Poor Results. Why?

Ans: Unlike Named Entities, nominal and pronominal
mentions not preserved during MT.

Possible area of future Research.


Conclusion
Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili.
SYNERGY’s F1 score for Arabic comes very close to the
state-of-the-art.
SYNERGY’s F1 score for Swahili is a good first effort in the
field of Swahili NER.
Freely available at http://www.cs.cmu.edu/~encore/. We
hope it will be a valuable tool for researchers wishing to work
with Swahili text.
Intend to use SYNERGY to perform NER for other
resource-scarce languages supported by online translators.
Related Talk: “ENCORE: Experiments with a Synthetic Entity
Co-reference Resolution Tool” by Bo Lin, LREC Workshop on
Resources and Evaluation for Entity Resolution and Entity
Management (W16), May 22, 2010.

And we’re done.

Thank You!


Example - Arabic

Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrp
l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr
jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn
AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt
AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”

NE labeled English sentence: “Refer to this phenomenon because it
is the atmosphere with a team of President George W. Bush Rice
and Ambassador John Bolton at the United Nations which will
recognize a lot of elements that might make some Europeans to
deter the persistence of the United States decision to invest in the
Israeli aggression on Lebanon.”

After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhA
t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn
bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr
Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt
AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”


SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Similar to SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation (20)

More from Guy De Pauw

More from Guy De Pauw (20)

Recently uploaded

Recently uploaded (20)

SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation