SlideShare a Scribd company logo
1 of 22
Download to read offline
SYNERGY
             A Named Entity Recognition System for Resource-scarce
           Languages such as Swahili using Online Machine Translation


            Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking


                                              May 18th, 2010




                                 Language Technologies Institute
                                   Carnegie Mellon University
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
NER for Resource-scarce languages: Overview
              Named Entity Recognition (NER) is the task of finding names
              in text, and optionally, classifying them into persons,
              organizations, locations, etc.

              Example: Wolff, currently a journalist in Argentina, played
              with Del Bosque in the final years of the seventies in Real
              Madrid.

              Vital first step for many problems such as parsing, co-reference
              resolution, translation, indexing and semantic analysis.

              State-of-art NER systems rely on machine learning.

              Requires large amounts of labeled training data. Costly and
              time-consuming.

              Upshot: For many widely spoken languages e.g. Swahili, no
              NER systems freely available.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Motivation


              Online Machine Translation (MT) systems such as Google
              Translate and Microsoft Bing Translator support many
              resource-scarce languages for which NER systems don’t exist.

              Google Translate supports Swahili ⇐⇒ English translation.

              Quality of translation is far from perfect.

              However, this might still be good enough for NER. Why?

              Observation: Named Entities (NEs) are preserved during
              Swahili ⇐⇒ English translation.


Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Motivation: Named Entity Preservation - Example

      All Named Entities shown in Red:
              Sample Swahili sentence: “Lundenga aliwataja mawakala
              ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa
              Dodoma Mbeya Mwanza na Ruvuma.”

              English version from Google Translate: “Lundenga mentioned
              shatuma agents who have a prayer from the regions of
              Dodoma Iringa Mwanza Mbeya and Ruvuma.”


              Sample English sentence: “President Obama advised German
              Chancellor Angela Merkel.”

              Swahili version from Google Translate: “Rais Obama
              wanashauriwa Ujerumani Angela Merkel.”


Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
SYNERGY: Big Picture




         1. Swahili text is translated to English.
         2. The best off-the-shelf NER systems are applied to the
            resulting English text.
         3. English NEs are mapped back to words in the Swahili text
            using word alignment.
         4. Post-Processing using dictionaries to improve performance

              Observation: SYNERGY addresses the problem of NER for a
              new language by breaking it into three relatively easier
              problems.
              Observation: No longer need labeled training data.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Comparison to existing NER systems



              No Swahili NER system available, so how do we compare
              SYNERGY and its MT-based approach?


              Extend SYNERGY to perform NER for another language for
              which freely available NER systems do exist.


              We use Arabic, because it is typically considered a ‘hard’
              language for which to do NER.




Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Resources - Machine Translation


              Google Translate for Swahili and Arabic text. Microsoft Bing
              Translator not yet ready for Arabic, doesn’t support Swahili.

              Would like to use other systems, but for these languages, none
              freely available.

              Advantage: As soon as new languages become available on
              any MT system, SYNERGY can be used with few changes.

              Can also use MT systems automatically generated from
              parallel data using freely available toolkits such as Moses.




Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Resources - Named Entities


              Swahili labeled data: Helsinki Corpus of Swahili.

              Arabic labeled data: ANERcorp.

              NER: Stanford’s CRF based NER system and UIUC’s LBJ NE
              Tagger. State-of-art in English NER.

              Use only off-the-shelf NERs to show that a good NER system
              can be developed for new languages quickly.

              Post-Processing: Swahili-English dictionary by Kamusi Project
              and the Linguistic Data Consortium’s Arabic TreeBank.




Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Architecture: 1 - Translation to English




              Perl interface to Google Translate.

              Translate each sentence individually.

              Many sentences too long for Google Translate API, need to be
              split. Named Entities unaffected by this.

              Produces a translated English document for each Swahili
              document.

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Architecture: 2 - NER on Translated English Text




              SYNERGY uses a combination of Stanford and UIUC NER
              systems called Union.
              Why? Observation: Stanford and UIUC NER systems don’t
              always make the same mistakes.
              Idea: Exploit this to improve performance.
                                System         Precision    Recall    F1
                                Stanford        0.962       0.963    0.963
                                 UIUC           0.968       0.964    0.966
                                 Union          0.952       0.985    0.968

              Table: Performance of NER systems on CoNLL 2003 Test Set
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Architecture: 3 - Word Alignment




              Alignment of English NEs with words in original Swahili
              document to discover Swahili NEs.

              Most challenging component of SYNERGY.

              We try two different algorithms:

                 1. Window Based Alignment.

                 2. GIZA++ Based Alignment.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Architecture: 3 - Word Alignment - Window Based

              Initial Approach:
                 1. Translate each English NE word back to a Swahili word
                 2. Search in the original Swahili document for a match within a
                    window of words.

              Produces very few matches.
              Modification:
                 1. Create a new English document by translating each Swahili
                    word one at a time
                 2. Search in this new document for English NE words within a
                    window of words.
                 3. Keep pointers from each English word in new document to
                    Swahili word that produces it.


Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Architecture: 3 - Word Alignment - GIZA++ Based


              GIZA++ is state-of-art for word alignment in Machine
              Translation.

         1. Takes an input language corpus and an output language
            corpus, and automatically generates word classes for both.

         2. For each sentence in the input language corpus finds the most
            probable alignment of the corresponding output language
            sentence.

              SYNERGY: The translated English documents serve as the
              input corpus and the Swahili documents as the output corpus.



Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Architecture: 4 - Post-Processing




              Compile POS tagged English, Swahili and Arabic dictionaries.
              Divide into Named Entity (NE) and Non-Named Entity
              (NNE) sections. A word may occur in both sections.
         1. If a word labeled as a Named Entity occurs in the NNE section
            of a dictionary but not in its NE section, the label is removed.
         2. If a word not labeled as a Named Entity occurs in the NE
            section of a dictionary but not in its NNE section, the word is
            labeled as a Named Entity.
              Yields significant improvement in NER performance.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Results
                             Version                       Prec      Recl       F1
                         Window w/o PP                     0.676     0.694     0.685
                         Window with PP                    0.818     0.704     0.757
                         GIZA++ w/o PP                     0.534     0.900     0.670
                         GIZA++ with PP                    0.754     0.886     0.815
                                    Table: Results for Swahili NER


                         System                                    Prec      Recl       F1
                SYNERGY Window w/o PP                              0.680     0.502     0.578
                SYNERGY Window with PP                             0.848     0.600     0.703
                SYNERGY GIZA++ w/o PP                              0.530     0.702     0.603
                SYNERGY GIZA++ with PP                             0.761     0.817     0.788
                 Benajiba and Rosso, 2008                          0.869     0.727     0.792
                                     Table: Results for Arabic NER

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking     SYNERGY
Error Analysis

              Three possible types of errors:
                     Named Entities that are lost during translation from the source
                     language to English.
                     Errors made by the English NER module of SYNERGY
                     A correctly recognized English NE word that gets mapped to
                     the wrong word in the source language document during
                     alignment

              Analysis of distribution of SYNERGY’s errors requires NE gold
              standard data for the English translations of our Swahili and
              Arabic test sets in addition to native gold standard data.

              Since no other known systems employ an MT based approach
              to NER, such data is not currently available.


Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Example - Swahili
      True NEs in Bold, System output in Red.
              Sample Swahili sentence: “Lundenga aliwataja mawakala
              ambao wameshatuma maombi kuwa ni kutoka mikoa ya
              Iringa Dodoma Mbeya Mwanza na Ruvuma.”

              Translated English sentence: “Lundenga mentioned shatuma
              agents who have a prayer from the regions of Dodoma Iringa
              Mwanza Mbeya and Ruvuma.”

              NE labeled English sentence: “Lundenga mentioned shatuma
              agents who have a prayer from the regions of Dodoma Iringa
              Mwanza Mbeya and Ruvuma.”

              After GIZA++ alignment and post-processing: “Lundenga
              aliwataja mawakala ambao wameshatuma maombi kuwa ni
              kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na
              Ruvuma.”
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Observations



              Initial motivation validated by SYNERGY’s results: Although
              there are many inaccuracies in both the Swahili and Arabic
              translations, a vast majority of NE words are preserved across
              translation and successfully recognized by an English NER
              system.


              Performing English NER helps us avoid difficulties inherent in
              native Swahili and Arabic NER, e.g. ambiguous function
              words, recognizing clitics, etc.




Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Future work - Co-reference Resolution


              Determining if two entity mentions in a text refer to the same
              entity in real world or not.
              Three types of entity mentions:
                 1. Named mentions (i.e. NEs) (e.g. Barack Obama)
                 2. Nominal mentions (e.g. President of United States)
                 3. Pronominal mentions (e.g. he).

              Poor Results. Why?

              Ans: Unlike Named Entities, nominal and pronominal
              mentions not preserved during MT.

              Possible area of future Research.



Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Conclusion
              Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili.
              SYNERGY’s F1 score for Arabic comes very close to the
              state-of-the-art.
              SYNERGY’s F1 score for Swahili is a good first effort in the
              field of Swahili NER.
              Freely available at http://www.cs.cmu.edu/~encore/. We
              hope it will be a valuable tool for researchers wishing to work
              with Swahili text.
              Intend to use SYNERGY to perform NER for other
              resource-scarce languages supported by online translators.
              Related Talk: “ENCORE: Experiments with a Synthetic Entity
              Co-reference Resolution Tool” by Bo Lin, LREC Workshop on
              Resources and Evaluation for Entity Resolution and Entity
              Management (W16), May 22, 2010.
Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
And we’re done.




                                       Thank You!




Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY
Example - Arabic

              Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrp
              l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr
              jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn
              AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt
              AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”

              NE labeled English sentence: “Refer to this phenomenon because it
              is the atmosphere with a team of President George W. Bush Rice
              and Ambassador John Bolton at the United Nations which will
              recognize a lot of elements that might make some Europeans to
              deter the persistence of the United States decision to invest in the
              Israeli aggression on Lebanon.”

              After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhA
              t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn
              bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr
              Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt
              AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.”

Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking   SYNERGY

More Related Content

Viewers also liked

Dictionary-based named entity recognition
Dictionary-based named entity recognitionDictionary-based named entity recognition
Dictionary-based named entity recognitionLars Juhl Jensen
 
A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
A Semi-Automatic Annotation Tool For Arabic Online Handwritten TextA Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
A Semi-Automatic Annotation Tool For Arabic Online Handwritten TextRanda Elanwar
 
Exploring Linked Data content through network analysis
Exploring Linked Data content through network analysisExploring Linked Data content through network analysis
Exploring Linked Data content through network analysisChristophe Guéret
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionYunyao Li
 
Linked Data: What’s the Story?
Linked Data: What’s the Story?Linked Data: What’s the Story?
Linked Data: What’s the Story?WiLS
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyTimm Heuss
 
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Olivier Grisel
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognitionDhwaj Raj
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataDave Lewis
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataEUCLID project
 
Enhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsEnhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsJulien PLU
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing Rajnish Raj
 

Viewers also liked (20)

Dictionary-based named entity recognition
Dictionary-based named entity recognitionDictionary-based named entity recognition
Dictionary-based named entity recognition
 
Named Entities
Named EntitiesNamed Entities
Named Entities
 
A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
A Semi-Automatic Annotation Tool For Arabic Online Handwritten TextA Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
A Semi-Automatic Annotation Tool For Arabic Online Handwritten Text
 
Exploring Linked Data content through network analysis
Exploring Linked Data content through network analysisExploring Linked Data content through network analysis
Exploring Linked Data content through network analysis
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Linked Data: What’s the Story?
Linked Data: What’s the Story?Linked Data: What’s the Story?
Linked Data: What’s the Story?
 
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific VocabularyA Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
A Comparison of NER Tools w.r.t. a Domain-Specific Vocabulary
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
Universal Topic Classification - Named Entity Disambiguation (IKS Workshop Pa...
 
Multlingual Linked Data Patterns
Multlingual Linked Data PatternsMultlingual Linked Data Patterns
Multlingual Linked Data Patterns
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
 
Text mining
Text miningText mining
Text mining
 
RDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization dataRDF and other linked data standards — how to make use of big localization data
RDF and other linked data standards — how to make use of big localization data
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Discoverers of Surface Analysis
Discoverers of Surface AnalysisDiscoverers of Surface Analysis
Discoverers of Surface Analysis
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Enhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER ModelsEnhancing Entity Linking by Combining NER Models
Enhancing Entity Linking by Combining NER Models
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing
 
Recipes for PhD
Recipes for PhDRecipes for PhD
Recipes for PhD
 

Similar to SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...IJECEIAES
 
IRJET - Analysis on Code-Mixed Data for Movie Reviews
IRJET - Analysis on Code-Mixed Data for Movie ReviewsIRJET - Analysis on Code-Mixed Data for Movie Reviews
IRJET - Analysis on Code-Mixed Data for Movie ReviewsIRJET Journal
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech TaggingA Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech TaggingGuy De Pauw
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...IRJET Journal
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO RIILP
 
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...ijnlc
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part IDelip Rao
 

Similar to SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation (20)

Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
 
IRJET - Analysis on Code-Mixed Data for Movie Reviews
IRJET - Analysis on Code-Mixed Data for Movie ReviewsIRJET - Analysis on Code-Mixed Data for Movie Reviews
IRJET - Analysis on Code-Mixed Data for Movie Reviews
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
 
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech TaggingA Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
**JUNK** (no subject)
**JUNK** (no subject)**JUNK** (no subject)
**JUNK** (no subject)
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO
 
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
IMPLEMENTATION OF NLIZATION FRAMEWORK FOR VERBS, PRONOUNS AND DETERMINERS WIT...
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part I
 
Hi I am Ram.pptx
Hi I am Ram.pptxHi I am Ram.pptx
Hi I am Ram.pptx
 

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

SYNERGY - A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation

  • 1. SYNERGY A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking May 18th, 2010 Language Technologies Institute Carnegie Mellon University Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 2. NER for Resource-scarce languages: Overview Named Entity Recognition (NER) is the task of finding names in text, and optionally, classifying them into persons, organizations, locations, etc. Example: Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid. Vital first step for many problems such as parsing, co-reference resolution, translation, indexing and semantic analysis. State-of-art NER systems rely on machine learning. Requires large amounts of labeled training data. Costly and time-consuming. Upshot: For many widely spoken languages e.g. Swahili, no NER systems freely available. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 3. Motivation Online Machine Translation (MT) systems such as Google Translate and Microsoft Bing Translator support many resource-scarce languages for which NER systems don’t exist. Google Translate supports Swahili ⇐⇒ English translation. Quality of translation is far from perfect. However, this might still be good enough for NER. Why? Observation: Named Entities (NEs) are preserved during Swahili ⇐⇒ English translation. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 4. Motivation: Named Entity Preservation - Example All Named Entities shown in Red: Sample Swahili sentence: “Lundenga aliwataja mawakala ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na Ruvuma.” English version from Google Translate: “Lundenga mentioned shatuma agents who have a prayer from the regions of Dodoma Iringa Mwanza Mbeya and Ruvuma.” Sample English sentence: “President Obama advised German Chancellor Angela Merkel.” Swahili version from Google Translate: “Rais Obama wanashauriwa Ujerumani Angela Merkel.” Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 5. SYNERGY: Big Picture 1. Swahili text is translated to English. 2. The best off-the-shelf NER systems are applied to the resulting English text. 3. English NEs are mapped back to words in the Swahili text using word alignment. 4. Post-Processing using dictionaries to improve performance Observation: SYNERGY addresses the problem of NER for a new language by breaking it into three relatively easier problems. Observation: No longer need labeled training data. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 6. Comparison to existing NER systems No Swahili NER system available, so how do we compare SYNERGY and its MT-based approach? Extend SYNERGY to perform NER for another language for which freely available NER systems do exist. We use Arabic, because it is typically considered a ‘hard’ language for which to do NER. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 7. Resources - Machine Translation Google Translate for Swahili and Arabic text. Microsoft Bing Translator not yet ready for Arabic, doesn’t support Swahili. Would like to use other systems, but for these languages, none freely available. Advantage: As soon as new languages become available on any MT system, SYNERGY can be used with few changes. Can also use MT systems automatically generated from parallel data using freely available toolkits such as Moses. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 8. Resources - Named Entities Swahili labeled data: Helsinki Corpus of Swahili. Arabic labeled data: ANERcorp. NER: Stanford’s CRF based NER system and UIUC’s LBJ NE Tagger. State-of-art in English NER. Use only off-the-shelf NERs to show that a good NER system can be developed for new languages quickly. Post-Processing: Swahili-English dictionary by Kamusi Project and the Linguistic Data Consortium’s Arabic TreeBank. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 9. Architecture: 1 - Translation to English Perl interface to Google Translate. Translate each sentence individually. Many sentences too long for Google Translate API, need to be split. Named Entities unaffected by this. Produces a translated English document for each Swahili document. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 10. Architecture: 2 - NER on Translated English Text SYNERGY uses a combination of Stanford and UIUC NER systems called Union. Why? Observation: Stanford and UIUC NER systems don’t always make the same mistakes. Idea: Exploit this to improve performance. System Precision Recall F1 Stanford 0.962 0.963 0.963 UIUC 0.968 0.964 0.966 Union 0.952 0.985 0.968 Table: Performance of NER systems on CoNLL 2003 Test Set Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 11. Architecture: 3 - Word Alignment Alignment of English NEs with words in original Swahili document to discover Swahili NEs. Most challenging component of SYNERGY. We try two different algorithms: 1. Window Based Alignment. 2. GIZA++ Based Alignment. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 12. Architecture: 3 - Word Alignment - Window Based Initial Approach: 1. Translate each English NE word back to a Swahili word 2. Search in the original Swahili document for a match within a window of words. Produces very few matches. Modification: 1. Create a new English document by translating each Swahili word one at a time 2. Search in this new document for English NE words within a window of words. 3. Keep pointers from each English word in new document to Swahili word that produces it. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 13. Architecture: 3 - Word Alignment - GIZA++ Based GIZA++ is state-of-art for word alignment in Machine Translation. 1. Takes an input language corpus and an output language corpus, and automatically generates word classes for both. 2. For each sentence in the input language corpus finds the most probable alignment of the corresponding output language sentence. SYNERGY: The translated English documents serve as the input corpus and the Swahili documents as the output corpus. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 14. Architecture: 4 - Post-Processing Compile POS tagged English, Swahili and Arabic dictionaries. Divide into Named Entity (NE) and Non-Named Entity (NNE) sections. A word may occur in both sections. 1. If a word labeled as a Named Entity occurs in the NNE section of a dictionary but not in its NE section, the label is removed. 2. If a word not labeled as a Named Entity occurs in the NE section of a dictionary but not in its NNE section, the word is labeled as a Named Entity. Yields significant improvement in NER performance. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 15. Results Version Prec Recl F1 Window w/o PP 0.676 0.694 0.685 Window with PP 0.818 0.704 0.757 GIZA++ w/o PP 0.534 0.900 0.670 GIZA++ with PP 0.754 0.886 0.815 Table: Results for Swahili NER System Prec Recl F1 SYNERGY Window w/o PP 0.680 0.502 0.578 SYNERGY Window with PP 0.848 0.600 0.703 SYNERGY GIZA++ w/o PP 0.530 0.702 0.603 SYNERGY GIZA++ with PP 0.761 0.817 0.788 Benajiba and Rosso, 2008 0.869 0.727 0.792 Table: Results for Arabic NER Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 16. Error Analysis Three possible types of errors: Named Entities that are lost during translation from the source language to English. Errors made by the English NER module of SYNERGY A correctly recognized English NE word that gets mapped to the wrong word in the source language document during alignment Analysis of distribution of SYNERGY’s errors requires NE gold standard data for the English translations of our Swahili and Arabic test sets in addition to native gold standard data. Since no other known systems employ an MT based approach to NER, such data is not currently available. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 17. Example - Swahili True NEs in Bold, System output in Red. Sample Swahili sentence: “Lundenga aliwataja mawakala ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na Ruvuma.” Translated English sentence: “Lundenga mentioned shatuma agents who have a prayer from the regions of Dodoma Iringa Mwanza Mbeya and Ruvuma.” NE labeled English sentence: “Lundenga mentioned shatuma agents who have a prayer from the regions of Dodoma Iringa Mwanza Mbeya and Ruvuma.” After GIZA++ alignment and post-processing: “Lundenga aliwataja mawakala ambao wameshatuma maombi kuwa ni kutoka mikoa ya Iringa Dodoma Mbeya Mwanza na Ruvuma.” Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 18. Observations Initial motivation validated by SYNERGY’s results: Although there are many inaccuracies in both the Swahili and Arabic translations, a vast majority of NE words are preserved across translation and successfully recognized by an English NER system. Performing English NER helps us avoid difficulties inherent in native Swahili and Arabic NER, e.g. ambiguous function words, recognizing clitics, etc. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 19. Future work - Co-reference Resolution Determining if two entity mentions in a text refer to the same entity in real world or not. Three types of entity mentions: 1. Named mentions (i.e. NEs) (e.g. Barack Obama) 2. Nominal mentions (e.g. President of United States) 3. Pronominal mentions (e.g. he). Poor Results. Why? Ans: Unlike Named Entities, nominal and pronominal mentions not preserved during MT. Possible area of future Research. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 20. Conclusion Best-case F1 = 0.788 for Arabic and F1 = 0.815 for Swahili. SYNERGY’s F1 score for Arabic comes very close to the state-of-the-art. SYNERGY’s F1 score for Swahili is a good first effort in the field of Swahili NER. Freely available at http://www.cs.cmu.edu/~encore/. We hope it will be a valuable tool for researchers wishing to work with Swahili text. Intend to use SYNERGY to perform NER for other resource-scarce languages supported by online translators. Related Talk: “ENCORE: Experiments with a Synthetic Entity Co-reference Resolution Tool” by Bo Lin, LREC Workshop on Resources and Evaluation for Entity Resolution and Entity Management (W16), May 22, 2010. Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 21. And we’re done. Thank You! Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY
  • 22. Example - Arabic Arabic sentence (in Buckwalter encoding): “n$yr AlY h*h AlZAhrp l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.” NE labeled English sentence: “Refer to this phenomenon because it is the atmosphere with a team of President George W. Bush Rice and Ambassador John Bolton at the United Nations which will recognize a lot of elements that might make some Europeans to deter the persistence of the United States decision to invest in the Israeli aggression on Lebanon.” After GIZA++ and post-processing: “n$yr AlY h*h AlZAhrp l¿nhA t$kl Aljw Al*y yEml fyh fryq Alr}ys jwrj bw$ wrAys wAlsfyr jwn bwltwn fy AlAmm AlmtHdp , wAl*y symyz Alkvyr mn AlEnASr Alty qd yqdmhA bED AlAwrwbyyn lrdE tmAdy AlwlAyAt AlmtHdp fy AlAstvmAr bqrAr AlEdwAn AlAsrA}yly ElY lbnAn.” Rushin Shah, Bo Lin, Anatole Gershman, Robert Frederking SYNERGY