Multilingual Named Entity Recognition
           using Wikipedia
    Laboratory for Knowledge Discovery in Databases
   Department of Computing and Information Sciences
                 Kansas State University
     http://www.kddresearch.org/tikiwiki/tiki-index.php




              Presenter: Svitlana O. Volkova
                 Instructor: William Hsu
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   Synonymy Discovery with Google Sets
IV.    Experiment Design
V.     Conclusions
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
PROJECT MILESTONES

Input: Crawler Functionality
CRAWLING WIKIPEDIA
Output: Set of Multilingual Gazetteers


      Input: Initial Gazetteer in one Language
      RELATIONSHIP DISCOVERY WITH GOOGLESETS
      Output: Extended Gazetteer with Synonyms


             Input: Extended Gazetteer with Synonyms + Content
             MULTILINGUAL NER TASK
             Output: Extracted Entities from the Content
KEY IDEA - WIKIPEDIA
 Apply Wikipedia knowledge representation for
  multilingual information extraction
             English Wiki Concepts of Interest
      …, anthrax, bovine virus, …, camelpox, surra, …




             17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis



           Russian Wiki Concepts of Interest
 …, Зоонозы, Классическая чума свиней, Лептоспироз, …
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
CRAWLING WIKIPEDIA



Multilingual NER
(article + category
 +interwiki links)


                      Wiki Category Graph and Article Graph
GAZETTEERS EXAMPLES IN DIFFERENT
           LANGUAGES
GAZETTEERS SIZE IN DIFFERENT
                 LANGUAGES


                            19

               37                                                English
                                                      86         Japanese
                                                                 German
                       20                                        Russian




Decision: dictionaries are too small, so wee need to find a way how to
                             extend it!!!
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
GAZETTEERS EXAMPLES:
GERMAN GOOGLE SETS OUTPUT
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
EXPERIMENT SET UP
 Purpose: to perform named entity recognition task in
  specific domain and report accuracy of extraction using
  a) Wiki knowledge
  b) Extended lists with synonyms from Google Sets


 Hypothesis: the synonyms extraction phase is essential
  for increasing accuracy of information extraction task
DISEASE EXTRACTOR MODULE
                 INPUT AND OUTPUT
                                             Output:
                                             Index of the first character

                         Disease             Index of the last character
                        Extractor            Length of the matched text
           Input: Text Module
              from file                      Matched Text
                                             Canonical disease name
Disease ExtractionTask
  The task of disease recognition can be considered as NER/information
    extraction (IE) task
  The main purpose is to retrieve tokens that much at least one term with
    synonyms, abbreviations from list of the animal disease names
CONTEXT EXAMPLES IN DIFFERENT LANGUAGES
DUTCH
    Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.
      Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.
CZECH
    Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než
      polovina případů se vyskytuje v těžké a vyžaduje resuscitaci.
GERMAN
    Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als
      die Hälfte der Fälle tritt in schweren und Reanimation erforderlich.
ITALIAN
     Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei
      casi si verifica in rianimazione grave e richiesti.
URKAINIAN
     Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.
      Більше половини випадків відбувається в суворих і необхідність реанімації.
RUSSIAN
     Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость
      высокая. Более половины случаев происходит в суровых и необходимости реанимации.
DISEASE EXTRACTOR MODULE DEMO
http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
RESULTS FOR DISEASE EXTRACTOR MODULE

       INPUT A                OUTPUT A
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…

       INPUT B                OUTPUT B
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
CONCLUSIONS
 ApplyingWikipedia knowledge for multilingual NERTask


 Phase 1: CrawlingWiki – completed
 Phase 2: Google Sets Expansion – completed
 Phase 3: Multilingual Disease Extraction – in progress


 Novelty: Overcome Wiki limitations by applying Google Sets
  expansion approach

 In order to estimate accuracy we need to have annotated data in
  different languages
REFERENCES
   Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP
    Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.
    1--8,               April             2007.          http://elara.tk.informatik.tu-
    darmstadt.de/publications/2007/hlt-textgraphs.pdf

   Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based
    Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,
    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
    Language Processing and Computational Natural Language Learning (EMNLP-
    CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068

   Manning, C., & Schutze, H. Foundations of statistical natural language processing.
    Cambridge, MA: MIT Press, 1999.
ACKNOWLEDGEMENTS

 Dr. William Hsu for meaningful guidance




 John Drouhard for building extraction architecture




 Landon Fowles for expanding gazetteers using Google Sets

Multilingual Ner Using Wiki

  • 1.
    Multilingual Named EntityRecognition using Wikipedia Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org/tikiwiki/tiki-index.php Presenter: Svitlana O. Volkova Instructor: William Hsu
  • 2.
    AGENDA I. Project Overview II. Crawling Wikipedia III. Synonymy Discovery with Google Sets IV. Experiment Design V. Conclusions
  • 3.
    AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 4.
    PROJECT MILESTONES Input: CrawlerFunctionality CRAWLING WIKIPEDIA Output: Set of Multilingual Gazetteers Input: Initial Gazetteer in one Language RELATIONSHIP DISCOVERY WITH GOOGLESETS Output: Extended Gazetteer with Synonyms Input: Extended Gazetteer with Synonyms + Content MULTILINGUAL NER TASK Output: Extracted Entities from the Content
  • 5.
    KEY IDEA -WIKIPEDIA  Apply Wikipedia knowledge representation for multilingual information extraction English Wiki Concepts of Interest …, anthrax, bovine virus, …, camelpox, surra, … 17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis Russian Wiki Concepts of Interest …, Зоонозы, Классическая чума свиней, Лептоспироз, …
  • 6.
    AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 7.
    CRAWLING WIKIPEDIA Multilingual NER (article+ category +interwiki links) Wiki Category Graph and Article Graph
  • 8.
    GAZETTEERS EXAMPLES INDIFFERENT LANGUAGES
  • 9.
    GAZETTEERS SIZE INDIFFERENT LANGUAGES 19 37 English 86 Japanese German 20 Russian Decision: dictionaries are too small, so wee need to find a way how to extend it!!!
  • 10.
    AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 11.
  • 12.
    AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 13.
    EXPERIMENT SET UP Purpose: to perform named entity recognition task in specific domain and report accuracy of extraction using a) Wiki knowledge b) Extended lists with synonyms from Google Sets  Hypothesis: the synonyms extraction phase is essential for increasing accuracy of information extraction task
  • 14.
    DISEASE EXTRACTOR MODULE INPUT AND OUTPUT Output: Index of the first character Disease Index of the last character Extractor Length of the matched text Input: Text Module from file Matched Text Canonical disease name Disease ExtractionTask  The task of disease recognition can be considered as NER/information extraction (IE) task  The main purpose is to retrieve tokens that much at least one term with synonyms, abbreviations from list of the animal disease names
  • 15.
    CONTEXT EXAMPLES INDIFFERENT LANGUAGES DUTCH  Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog. Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie. CZECH  Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než polovina případů se vyskytuje v těžké a vyžaduje resuscitaci. GERMAN  Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als die Hälfte der Fälle tritt in schweren und Reanimation erforderlich. ITALIAN  Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei casi si verifica in rianimazione grave e richiesti. URKAINIAN  Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока. Більше половини випадків відбувається в суворих і необхідність реанімації. RUSSIAN  Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость высокая. Более половины случаев происходит в суровых и необходимости реанимации.
  • 16.
    DISEASE EXTRACTOR MODULEDEMO http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
  • 18.
    RESULTS FOR DISEASEEXTRACTOR MODULE INPUT A OUTPUT A Foot and mouth disease is one of the most contagious diseases of cloven-hooved mammals… INPUT B OUTPUT B Rift Valley Fever | CDC Special Pathogens Branch Mission Statement Disease …
  • 19.
    AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 20.
    CONCLUSIONS  ApplyingWikipedia knowledgefor multilingual NERTask  Phase 1: CrawlingWiki – completed  Phase 2: Google Sets Expansion – completed  Phase 3: Multilingual Disease Extraction – in progress  Novelty: Overcome Wiki limitations by applying Google Sets expansion approach  In order to estimate accuracy we need to have annotated data in different languages
  • 21.
    REFERENCES  Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007. http://elara.tk.informatik.tu- darmstadt.de/publications/2007/hlt-textgraphs.pdf  Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068  Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.
  • 22.
    ACKNOWLEDGEMENTS  Dr. WilliamHsu for meaningful guidance  John Drouhard for building extraction architecture  Landon Fowles for expanding gazetteers using Google Sets