• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Multilingual Ner Using Wiki
 

Multilingual Ner Using Wiki

on

  • 1,173 views

 

Statistics

Views

Total Views
1,173
Views on SlideShare
1,171
Embed Views
2

Actions

Likes
0
Downloads
2
Comments
0

2 Embeds 2

http://www.slideshare.net 1
http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Multilingual Ner Using Wiki Multilingual Ner Using Wiki Presentation Transcript

    • Multilingual Named Entity Recognition using Wikipedia Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org/tikiwiki/tiki-index.php Presenter: Svitlana O. Volkova Instructor: William Hsu
    • AGENDA I. Project Overview II. Crawling Wikipedia III. Synonymy Discovery with Google Sets IV. Experiment Design V. Conclusions
    • AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
    • PROJECT MILESTONES Input: Crawler Functionality CRAWLING WIKIPEDIA Output: Set of Multilingual Gazetteers Input: Initial Gazetteer in one Language RELATIONSHIP DISCOVERY WITH GOOGLESETS Output: Extended Gazetteer with Synonyms Input: Extended Gazetteer with Synonyms + Content MULTILINGUAL NER TASK Output: Extracted Entities from the Content
    • KEY IDEA - WIKIPEDIA  Apply Wikipedia knowledge representation for multilingual information extraction English Wiki Concepts of Interest …, anthrax, bovine virus, …, camelpox, surra, … 17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis Russian Wiki Concepts of Interest …, Зоонозы, Классическая чума свиней, Лептоспироз, …
    • AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
    • CRAWLING WIKIPEDIA Multilingual NER (article + category +interwiki links) Wiki Category Graph and Article Graph
    • GAZETTEERS EXAMPLES IN DIFFERENT LANGUAGES
    • GAZETTEERS SIZE IN DIFFERENT LANGUAGES 19 37 English 86 Japanese German 20 Russian Decision: dictionaries are too small, so wee need to find a way how to extend it!!!
    • AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
    • GAZETTEERS EXAMPLES: GERMAN GOOGLE SETS OUTPUT
    • AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
    • EXPERIMENT SET UP  Purpose: to perform named entity recognition task in specific domain and report accuracy of extraction using a) Wiki knowledge b) Extended lists with synonyms from Google Sets  Hypothesis: the synonyms extraction phase is essential for increasing accuracy of information extraction task
    • DISEASE EXTRACTOR MODULE INPUT AND OUTPUT Output: Index of the first character Disease Index of the last character Extractor Length of the matched text Input: Text Module from file Matched Text Canonical disease name Disease ExtractionTask  The task of disease recognition can be considered as NER/information extraction (IE) task  The main purpose is to retrieve tokens that much at least one term with synonyms, abbreviations from list of the animal disease names
    • CONTEXT EXAMPLES IN DIFFERENT LANGUAGES DUTCH  Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog. Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie. CZECH  Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než polovina případů se vyskytuje v těžké a vyžaduje resuscitaci. GERMAN  Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als die Hälfte der Fälle tritt in schweren und Reanimation erforderlich. ITALIAN  Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei casi si verifica in rianimazione grave e richiesti. URKAINIAN  Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока. Більше половини випадків відбувається в суворих і необхідність реанімації. RUSSIAN  Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость высокая. Более половины случаев происходит в суровых и необходимости реанимации.
    • DISEASE EXTRACTOR MODULE DEMO http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
    • RESULTS FOR DISEASE EXTRACTOR MODULE INPUT A OUTPUT A Foot and mouth disease is one of the most contagious diseases of cloven-hooved mammals… INPUT B OUTPUT B Rift Valley Fever | CDC Special Pathogens Branch Mission Statement Disease …
    • AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
    • CONCLUSIONS  ApplyingWikipedia knowledge for multilingual NERTask  Phase 1: CrawlingWiki – completed  Phase 2: Google Sets Expansion – completed  Phase 3: Multilingual Disease Extraction – in progress  Novelty: Overcome Wiki limitations by applying Google Sets expansion approach  In order to estimate accuracy we need to have annotated data in different languages
    • REFERENCES  Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007. http://elara.tk.informatik.tu- darmstadt.de/publications/2007/hlt-textgraphs.pdf  Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068  Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.
    • ACKNOWLEDGEMENTS  Dr. William Hsu for meaningful guidance  John Drouhard for building extraction architecture  Landon Fowles for expanding gazetteers using Google Sets