Your SlideShare is downloading. ×
0
WiML Poster
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

WiML Poster

377

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
377
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Named Entity Annotation and Tagging in the Domain of Epizootics K-State Laboratory for Knowledge Svitlana Volkova, William Hsu, Doina Caragea Discovery in Databases (KDD) Kansas State University, Department of Computing and Information Sciences, Manhattan, KS 66506 GAZETEER COLLECTION AND ONTOLOGY CONSTRUCTION THE EFFECT OF THE ONTOLOGY SIZE AND QUALITY ON THE OVERVIEW The main purpose of IE using a gazetteer is to retrieve tokens that match at least ACCURACY OF DISEASE EXTRACTION We present an information extraction (IE) application in the domain of animal diseases. Previously, such tasks were performed only for human disease related one term with synonyms, abbreviations from known animal disease names. We The data set is sampled from animal disease crawled sources with number of data. As opposed to that, our task is directly related to web crawling for retrieving collect prior domain specific knowledge and, as a result, construct ontology of occurrences of the disease named entities above predefined threshold. All animal animal infectious disease related information. animal disease concepts. The extraction technique is based on a pattern matching diseases were manually annotated within the dataset for future cross-validation. approach. The gazetteer is semi-automatically collected from official web-portals. In the first experiment the baseline run is processed using dictionary look-up with WWW (official reports about animal disease outbreaks, Using initial gazetteer we enriched ontology with latent synonymic and causal and w/o capitalization feature (1a, 1b). The next runs include addition only surveillance networks disease descriptions, fact sheets etc.) relations between related concepts. synonyms (2a, 2b) and abbreviations (3a, 3b) respectively. The last run (4a, 4b) EMAIL combines all above mentioned features. Run 1, DOMAIN SPECIFIC Averaged 3.60% 1. Disease names and fact sheets from Iowa State University Center for KNOWLEDGE "FMD" Extraction Run 2 - only capitalization, Food Security and Public Health (CFSPH): In the second experiment we divide data into training and test sets. Using training Performance Document over 50 web-sites 14.00% http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm examples we learn the model for animal disease name extraction by discovering Medical ontology, containing 2. Word Organization of Animal Health (OIE) Animal Disease Data: Collection names of diseases, viruses, http://www.oie.int/eng/maladies/en_alpha.htm relations between concepts; we report accuracy on test data set. animal species etc., organized in 3. Department for Environmental Food and Rural Affairs, UK (DEFRA): In the third experiment, we compare our approach of learning relations between a conceptual hierarchy. Run 3 - only http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm abbreviations + synonyms, 84.36% 4. United States Department of Agriculture (USDA), Animal and Plant concepts with Google Sets method. We report results in terms of precision, recall CRAWLER Health Inspection Service and F-measure. We build learning curves for both methods in order to show the DB Averaged Run 1, DOMAIN INDEPENDENT "RVF" Extraction 0.20% http://www.aphis.usda.gov/animal_health/animal_diseases/ influence of the ontology size and quality on the accuracy of extracted results. KNOWLEDGE Performance Run 2 - only capitalization 5. Medline Plus, Service of National Library of Medicine and National over 50 web-sites 38.02% Institute of Health LITERATURE QUERY Location hierarchy, containing http://www.nlm.nih.gov/medlineplus/animaldiseasesandyourhealth.html List’s look up features: Document level features: keyword Word level names of countries, states or Run 3 - only abbreviation 6. Wikipedia flexible pattern match appearance within predefined window. morphological features provinces, cities, etc; canonical + synonyms, 57.52% http://en.wikipedia.org/wiki/Animal_diseases date/time representation. RELATION DISCOVERY BETWEEN CONCEPTS Method A: Number of Training Instances 429 773 955 1159 1287 1442 1561 1590 1619 1682 Synonymy (“is a kind of” relation, e.g. “Swine influenza” is a kind of “Swine fever”); Accuracy 0.964 0.929 0.927 0.925 0.964 0.929 0.927 0.925 0.964 0.929 INFORMATION EXTRACTION IN THE DOMAIN OF EPIZOOTICS Method B: Number of Training Instances The IE task in the domain of the epizootics can be defined as automatic extraction 429 754 925 1118 1238 1385 1497 1524 1552 1611 Accuracy of structured information that is related to animal diseases from unstructured web 0.962 0.961 0.864 0.862 0.962 0.961 0.864 0.862 0.962 0.961 documents with different content. The IE task is related to development of several Example A: “Diseases such as Foot and Mouth Disease, Bovine TB or Johne’s Disease Dictionary Look-Up: Number of Instances (max. 429) modules for tagging of specific entities such as: animal disease name, species, have far-reaching potential for major economic impact on cattle producers”. 1a 1b 2a 2b 3a 3b 4a 4b - - vaccines, serotypes etc. at the document-level within a crawled collection of Accuracy Causal links (“is caused by”, e.g. “Ovine epididymitis is caused by Brucella ovis”). 0.885 0.920 0.886 0.896 0.887 0.922 0.889 0.933 - - documents. ANIMAL Learning Curve for Method B Accuracy Learning Curve for Method A Accuracy (Relation Discovery using Google Sets) DISEASE (Relation Discovery within Training Data) DOCUMENT Goal: to extract structured 1.00 1.00 information with facts and 0.98 0.98 COLLECTION entities related to events from 0.96 0.96 Dipylidium Example F: “Bluetongue virus (BTV), a member of Orbivirus genus within the 0.94 0.94 unstructured or semistructured Q fever Baylisascariasis infection Reoviridae family causes Bluetongue disease in livestock (sheep, goat, cattle)”. 0.92 0.92 sources. 0.90 0.90 0.88 0.88 Coxiella Baylisascaris DICTIONARY LOOKUP METHOD FOR DISEASE EXTRACTION 0.86 0.86 Tapeworm 0.84 0.84 burnetii procyonis Output: 0.82 0.82 0.80 0.80 400 650 900 1150 1400 1650 400 650 900 1150 1400 1650 Index of the first/last character Number of Ontology Concepts Number of Ontology Concepts C. burnetii B. melis 1 F-Measure Precision/Recall Disease Matched text and length 0.9 1 0.8 0.8 Extractor 0.7 B. procyonis Module Canonical disease names 0.6 0.6 Input: Method B Example: The US saw its latest FMD outbreak in Montebello, 0.5 0.4 Method A 0.4 Text from file Associated Synonyms/Abbreviations 0.2 California in 1929 where 3,600 animals were slaughtered. 0.3 Gazetteer 0.2 0 B. transfuga 0.1 0 0.2 0.4 0.6 0.8 1 1.0 Non-unique/Unique diseases 0 Animal Disease Names Locations 0.9 Precision, Recall, F-measure 1 2 3 4 5 6 7 8 9 10 Runs Metod A Metod B Dictionary Look-Up 0.8 1a - using only initial gazetteer w/o capitalization Dates/Times Quantities 0.7 1b - using initial gazetteer + capitalization 0.6 FUTURE WORK CLASSIFICATION-BASED NAMED ENTITY RECOGNITION 0.5 2a - initial gazetteer + only synonyms w/o capitalization 2b - initial gazetteer + only synonyms with capitalization NLP TASKS 0.4 The animal disease extraction task is a 0.3 3a - init. gazetteer + only abbreviations w/o capitalization prerequisite for more advanced content Named Entity Recognition (NER) task is a subtask of IE which seeks to locate and 0.2 3b - init. gazetteer + only abbreviations with capitalization Foot-and-mouth disease[DIS] killed 15 4a - init. gazetteer + synonyms + abbrev. w/o capitaliz. analysis of the unstructured documents within hog on farm in Taiwan[LOC] classify atomic elements in text into predefined categories, such as: 0.1 Run corpora. So, the design of an NER-driven 0.0 4b - init. gazetteer + synonyms + abbrev. with capitaliz. Syntactic Analysis Foot-and-mouth disease [SUBJ] killed[VP]  disease names (e.g. “foot and mouth disease”); 4b 4a 3b 3a 2b 2a 1b 1a system for extracting structured tuples that 15 hog on farm in Taiwan [PP] Precision Recall F-Measure ACKNOWLEDGEMENTS describe animal disease-related events will  viruses (e.g. “picornavirus”) and serotypes (e.g. “Asia-1”); 1.0 This work is supported through a grant from the U.S. Department be performed. Fact: Disease: killed foot-and-mouth disease 4b of Defense. A collaborative program on IE with faculty at the Location: Taiwan  species and its quantities (e.g. “sheep”, “pigs”); 0.9 Recall Range The approach extends the shared NER task Extraction Species: hog 3b University of Illinois at Urbana-Champaign (ChengXiang Zhai, Dan 0.8 Roth, Jiawei Han, and Kevin Chang), the 2009 Data Sciences of identifying persons, organizations, and Quantity: 15  locations where outbreak happened (e.g. “United Kingdom”, “eastern provinces 0.7 3a Summer Institute (DSSI) on Multimodal Information Access and locations with not only disease names but Foot-and-mouth disease killed 15 hog 0.6 of Shandong and Jiangsu, China” – different level of granularity); 0.5 2b Synthesis (MIAS), was made possible through the support of DHS/ONR. constituent entities and attributes of these Co-reference on farm in Taiwan. Outbreak was reported on 9 June. 0.4 2a event tuples. These include dates and times, Resolution  dates in different formats including special cases (e.g. “last Tuesday”, “two 0.3 1b We appreciate effective discussions with Dr. Chris Callison-Burch, quantities with relevant units, and geo- Event: outbreak Dr. Mark Dredze and Dr. Jason Eisner from Center for Language month ago”); 0.2 1a and Speech Processing, Johns Hopkins University; Tim Weninger, referenced locations. A primary overall Species: Disease: 15 hog foot-and-mouth disease 0.1 Research Fellow, UIUC; objective of the IE task is to support timeline Location: Taiwan  organizations that reports outbreak (e.g. “DEFRA”, “CDC”). 0.0 4a Template Generation Date/Time: 9 June 0 50 100 Document number John Drouhard, Landon Fowles (KDD Lab, IE Team) for and map-based visualization of events. assistance with experiments. KANSAS STATE UNIVERSITY KNOWLEDGE DISCOVERY IN DATABASES LABORATORY NATIONAL AGRICULTURAL BIOSECURITY CENTER @ K-STATE

×