Your SlideShare is downloading. ×
Multimodal Information Extraction: Disease, Date and Location Retrieval
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Multimodal Information Extraction: Disease, Date and Location Retrieval


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Multimodal Information Extraction: Disease, DateTime, and Location Retrieval Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University Dr. William H. Hsu, Associate Professor of Computing and Information Sciences Svitlana O. Volkova, Graduate Research Assistant Timothy E. Weninger, Research Associate Jing Xia, Graduate Research Assistant Surya Teja Kallumadi, Graduate Research Assistant Wesam S. Elshamy, Graduate Research Assistant
  • 2. AGENDA  Overview  Document Extraction  Document Level Analysis: Entity Recognition Task  Disease Extractor Module: Disease Recognition Task & Future Improvements  Temporal Tagging  Date/Time Extractor Module: Date Recognition Task  Future Improvements for Date/Time Extractor Module  Spatial Tagging  Location Extractor Module: Location Recognition Task  Future Improvements for Location Extractor Module  Event Classification Task  Events Representation by Date/Time: Timeline View  Events Representation by Location: Map View
  • 3. MAIN STEPS Assist the integrator (Elder Research, Inc.) in incorporating these into a single system Perform collection-level analysis and interactive visualization of timelines, maps Extend the basic document-level IE, temporal annotation, and spatial annotation components with more state-of-the-field analytical functions
  • 4. HOW CAN WE GET DATA? WWW Information Retrieval (IR) EMAIL from Web by crawling news, blogs, reports, etc. CRAWLER DB QUERY DOCUMENTS LITERATURE COLLECTION
  • 5. DOCUMENTS COLLECTION DOMAIN SPECIFIC DOMAIN INDEPENDENT KNOWLEDGE KNOWLEDGE  medical ontology, containing  location hierarchy, containing names of diseases, viruses, names of countries, states or animal species etc., organized provinces, cities, etc; in a conceptual hierarchy.  canonical date and time representation.
  • 6. A TWO-LEVEL ANALYTICAL FRAMEWORK IN THE DOMAIN OF EPIZOOTICS Document Level Analysis Collection Level Analysis  Web document content  Semi-supervised Document extraction: Clustering & Linking by Finding  Named entity recognition Similarities by Keywords (NER)  Document Categorization as  Co-reference & association Topics Summarization Task resolution, relation extraction (pLSA, LDA )  Geotagging: location extraction, map view  Temporal tagging: date/time extraction, timeline view  Event Identification <…what, where, when, …>
  • 7. HIGH LEVEL SYSTEM’S ARCHITECTURE Data Search User Access and Query Control API (Java) Temporal Tagging: TimeLine Access View Privilege Spatial Tagging: Map View Internet Browser (IE/Mozilla/…) Event Detection Deduplication Data Store (MSSQL) Web Server Data Storage IAAC Server Researchers, public health professionals, and governmental health agencies, other users
  • 8. DOCUMENT LEVEL ANALYSIS Entity Recognition Task
  • 9. EXTENSION OF ENTITIES FOR MULTIMODAL INFORMATION EXTRACTION SYSTEM Stanford NER Entities KDD Group’s NER Entities  Person (e.g. “John Lenin”,  Animal diseases (e.g. “rift valley “William K. Smith”) fever”, “fmd”);  Organization (e.g. “U.K.  Date and time (e.g. “May 24 Department for Environment, 2001”, “last year”); Food and Rural Affairs”)  Location (e.g. “London, Great Britain”, “Manhattan, KS, USA”)  Location (e.g. “Europe”,  Animal Species (e.g. “cow”, “Canada”) “horse”, “mammals”)  Miscellaneous (e.g. “African”,  Quantities (e.g. # of animals researcher etc.) died, amount of money spend, $)
  • 10. INFORMATION EXTRACTION TASK Goal: Extract structured information with facts and entities related to events from unstructured/semistructured sources. Result: The US saw its latest FMD outbreak in Montebello, California in 1929 where 3,600 animals were slaughtered. DOCUMENTS Animal Disease Names Locations COLLECTION Dates/Times Quantities
  • 11. NAME ENTITIES REPRESENTATION FOR NER TASK  Disease Multi-Faceted Quantitative Summary  Location Map View  Date and time Timeline View Timeline View Example: lineedition/en/timeline.html Map View Example:
  • 12. DISEASE EXTRACTOR MODULE INPUT AND OUTPUT Output: Index of the first character Disease Index of the last character Extractor Length of the matched text Input: Text Module from file Matched Text Canonical disease name Disease ExtractionTask  The task of disease recognition can be considered as NER/information extraction (IE) task. The main purpose is to retrieve tokens that much at least one term from list of the disease names
  • 14. RESULTS FOR DISEASE EXTRACTOR MODULE INPUT A OUTPUT A Foot and mouth disease is one of the most contagious diseases of cloven-hooved mammals… INPUT B OUTPUT B Rift Valley Fever | CDC Special Pathogens Branch Mission Statement Disease …
  • 15. VOCABULARY CONSTRUCTION FOR DISEASE EXTRACTOR 1. Disease names and fact sheets from Iowa State University Center for Food Security and Public Health (CFSPH):  2.Word Organization of Animal Health (OIE) Animal Disease Data:  3. Department for Environmental Food and Rural Affairs, UK (DEFRA):  4. United States Department of Agriculture (USDA), Animal and Plant Health Inspection Service  5. MedlinePlus, Service of National Library of Medicine and National Institute of Health  6.Wikipedia 
  • 16. RESULTS FOR DISEASE EXTRACTOR MODULE ClearForest Gnosis Software:
  • 17. COMPARATIVE RESULTS FOR DISEASE EXTRACTORS: KDD GROUP’S VS. GNOSIS Disease Extraction "FMD" Disease Extraction "RVF" Quantities of Extracted Diseases Quantities of Extracted Diseases 400 180 350 Gnosis Soft. 160 Gnosis Soft. 300 140 KDD Group's 120 KDD Group's 250 Disease 100 Disease 200 Extractor Extractor 80 150 60 100 40 50 20 0 0 0 5 10 15 0 5 10 15 Number of seed Number of seeds Non-unique Animal Disease Extraction 1200 Non-unique Extracted Diseases 1000 Gnosis Soft. 800 600 400 KDD Group's Disease Extractor 200 0 0 2 4 6 8 10 12 14 Number of seeds
  • 18. COMPARATIVE RESULTS FOR UNIQUE DISEASE EXTRACTORS: KDD GROUP’S VS. GNOSIS Unique Disease Extraction 160 140 Extracted Unique Diseases 120 Gnosis Soft. 100 80 60 KDD Group's Disease Extractor 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of seeds Random Permutation of Extracted Diseases 400 # of Extracted Animal Diseases 350 Gnosis Soft. 300 250 KDD Group's Disease Extractor 200 150 100 50 0 1 2 3 4 5 6 7 Run number
  • 19. CUMULATIVE COMPARATIVE RESULTS FOR DISEASE EXTRACTORS: KDD GROUP’S VS. GNOSIS Cumulative Disease Extraction # of Extracted Animal Disease 800 700 y = 2.7283x2 + 14.914x - 4.4336 Gnosis Soft. 600 R² = 0.9762 500 KDD Group's Disease 400 Extractor 300 Poly. (Gnosis Soft.) 200 y = 4.1708x2 - 29.864x + 48.364 Poly. (KDD Group's 100 R² = 0.9831 Disease Extractor) 0 -100 0 2 4 6 8 10 12 14 Number of seeds KDD Group's Extractor: Results Gnosis Software: Extraction Results 160 90 # of unique extracted disease 140 of unique extracted disease 80 120 70 60 100 50 80 40 60 30 40 20 20 10 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 # of seeds' permutation # of seeds' permutations
  • 20. FUTURE IMPROVEMENTS FOR DISEASE EXTRACTOR MODULE Intermediate Functionality  to add functionality for species extraction and construct vocabulary;  to enrich dictionary with animal disease by species:  National Center of Infection Disease:   United States Department of Agriculture (APHIS), Animal Health:   to construct disease ontology with Protégé software. Advanced Functionality  to apply “seeds set expansion" approach for improvements of diseases extraction.
  • 21. DOCUMENT LEVEL ANALYSIS Temporal Tagging
  • 22. DATE/TIME EXTRACTOR AND EVENT TAGGER MODULE INPUT AND OUTPUT Output: Disease Name Date Event Trigger Time Input:Text Extractor Location from file Canonical date/time Temporal Extraction and EventsTaggingTask  The main purpose is extracting temporal quantities associated with events from text, identifying events and the semantic relatedness of events and summarizing them.  Extraction of temporal events involves identifying dates and times and the entities associated with these events.
  • 23. COMPONENTS OF DATE/TIME EXTRACTOR AND EVENT TAGGER MODULE Date/Time Extractor Pattern-Based Event Named Entity Extractor Recognition Tool It is based on quantities and units’ chunker It is built through analysis of Standard Time data structure It extracts Named Entities: the reports of disease outbreak: Location, Person, Organization e.g.“a report has been and Disease confirmed that …” Goal: Extracting facts and entity relations associated with events. Disease outbreaks: disease, organisms, victim, symptoms, location, country, date, containment measures …
  • 25. EVENT REPRESENTATION BY DATE/TIME: TIMELINE VIEW Advanced functionality of Date/Time Extractor Module includes resolving of timeline mapping of events. Representative example can be found on EMM News Explorer:
  • 26. FUTURE IMPROVEMENTS FOR DATE/TIME EXTRACTOR MODULE Intermediate Functionality  to implement event extraction as event tuple <what[Disease], where[Location], when[DateTime]> by individual entities that were obtained from Disease, Temporal and Spatial Extraction Modules in Basic Phase. Advanced Functionality  spatiotemporal clustering, extraction of qualitative and quantitative details about events from documents, and relationship extraction among events;  to integrate information extraction and information visualization components.
  • 27. DOCUMENT LEVEL ANALYSIS Spatial Tagging
  • 28. LOCATION EXTRACTOR MODULE INPUT AND OUTPUT NGA GEOnet Names Server (GNS) Output: Location Matched text (location) Extractor Location’s latitude Input:Text Module from file Location’s longitude Location’s radius Location ExtractionTask  Goal is to extract and tag geographical location mentions in the given text as part of the multimodal event extraction application. Extracted locations from the given text is presented to the user with their geographical latitude and longitude coordinates.
  • 29. RESULTS FOR LOCATION EXTRACTOR MODULE INPUT OUTPUT A third case of Foot-and- Mouth Disease in Kansas was reported yesterday in a small farm North East of Topeka. Roger Pride, who owns the farm where foot- and-mouth was discovered, said the financial hardship of losing his cattle was not as devastating as the impact on his reputation. It is to be noted that the previous two cases were reported earlier this month in Wichita and Leavenworth.
  • 30. FUTURE IMPROVEMENTS FOR LOCATION EXTRACTOR MODULE Intermediate Functionality  improves on the results obtained using the basic phase by filtering out outliers, deduplicating, and possibly clustering them. Advanced Functionality  by considering implicit spatial relationships and independent observations that would add richness to the data presented to the user and would help in detecting pattern among them.
  • 31. EVENT REPRESENTATION BY LOCATION: MAP VIEW Advanced functionality of Location Extractor Module includes resolution of geotagging task that means mapping events that were extracted from different resources. Representative example can be found on
  • 32. DOCUMENT LEVEL ANALYSIS Event Classification/Identification Task
  • 33. ESSENTIAL TASKS FOR EVENT TRACKING  Automatic population of large databases with factual information from many text sources  Rapid semantic processing of large volumes of unstructured text  Automatic merging of facts and entity relationships across sets of documents  Innovative techniques for extracting, summarizing and tracking information about events and their progressions over time from unstructured text  Identification of events and outbreaks includes constituent tasks of date, time, and quantity extraction and timeline visualization, while geospatial IE includes location (in latter stages) disambiguation and map view visualization.
  • 34. EVENT FORMAL REPRESENTATION  Event is an occurrence of disease within particular time and space range, so the single event attributes are: specific disease,date and time and location:  Event examples with missing values:
  • 35. ADDITIONAL ASPECTS OF EVENT/OUTBREAK  Outbreak Status - confirmed  Date of event’s report - 12.18.2007  Reported source -  Suffered species - cattle  Morbidity/Mortality - 155 infected/12 died  Damage measure, $ - $155,000  Standard features for event identification: <disease, location, date/time…> + <…person, organization,… + <…, length of sentence, quantities, temporal/spatial terms occurrences…>
  • 36. OUTBREAK FORMAL REPRESENTATION  Outbreak is a collection of events that are connected by some disease that happened within restricted space and time:  For outbreak identification events should be similar in temporal features: time overlap and similar in spatial features: space overlap
  • 37. DATA FLOW FOR EVENT IDENTIFICATION BASED ON SENTENCES CLASSIFICATION OUTBREAK Disease: foot-and-mouth disease Species: hog Location: Taiwan DateTime: 06/09/2009 Status: N/A
  • 38. NLP TASKS Foot-and-mouth disease[DIS] killed 15 hog on farm in Taiwan[LOC] Foot-and-mouth disease [SUBJ] killed[VP] 15 hog Syntactic Analysis on farm in Taiwan [PP] Fact: killed Disease: foot-and-mouth disease Location: Taiwan Species: hog Extraction Quantity: 15 Foot-and-mouth disease killed 15 hog on farm Co-reference Resolution in Taiwan. Outbreak was reported on 9 June. Event: outbreak Species: 15 hog Disease: foot-and-mouth disease Template Generation Location: Taiwan 39 DateTime: 9 June
  • 39. Demo: SEMANTIC ROLE LABELING TASK: EXAMPLE 1 Outbreak as event identification task can be considered as Semantic Role Labeling Task (SRL) - who did what to whom, when, where, why, …
  • 40. SEMANTIC ROLE LABELING TASK: EXAMPLE 2 Ecuador[LOC] - The Ecuadorian government[ORG] on Tuesday[DT] confirmed 48[QT] cases of foot-and-mouth disease[DIS] in domestic animals, which prompted neighboring Colombia[LOC] and Peru[LOC] to take preventive measures on their meat imports 41
  • 41. Thank you for attention!!!