A System for the Automatic Comparison of Machine and Human Geocoded Documents

804 views

Published on

Paper presented at GIR\'08 Napa CA

Published in: Health & Medicine, Travel
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
804
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A System for the Automatic Comparison of Machine and Human Geocoded Documents

  1. 1. A System for the Automatic Comparison of Machine and Human Geocoded Documents Ian Turton GeoVISTA Center Pennsylvania State University
  2. 2. ACKNOWLEDGMENTS <ul><li>This work is supported, in part, by the National Visualization and Analytics Center, a U.S Department of Homeland Security program operated by the Pacific Northwest National Laboratory (PNNL). PNNL is a U.S. Department of Energy Office of Science laboratory. </li></ul>
  3. 3. Introduction <ul><li>Document geocoding </li></ul><ul><ul><li>Take a large collection of documents </li></ul></ul><ul><ul><li>Extract named entities and geocode place names </li></ul></ul><ul><li>How well is our algorithm doing? </li></ul><ul><li>Comparison to known standard </li></ul><ul><li>Too many documents to read </li></ul>
  4. 4. Case Study <ul><li>PubMed abstracts </li></ul><ul><li>Avian Flu </li></ul><ul><li>4500+ abstracts </li></ul><ul><li>~2000 refer to a geographic location </li></ul><ul><li>MeSH tags are provided by human indexers </li></ul>
  5. 5. MeSH – Geographic Areas
  6. 6. Health GeoJunction
  7. 7. Document Footprint
  8. 8. Acquire Data <ul><li>First extract abstracts from PubMed </li></ul><ul><li>http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ </li></ul><ul><li>((avian OR bird) AND (influenza OR flu)) OR H5N1 </li></ul><ul><li>Returns a structured XML file with citation data and abstract for selected papers. </li></ul><ul><li>Process XML into PostGIS database </li></ul>
  9. 9. Extract Geographic Entities <ul><li>Use FactXtractor (http://julian.mine.nu/snedemo.html) </li></ul><ul><li>Uses GATE to detect and extract Named Entities and Entity Relationships </li></ul><ul><li>Usually finds People , Places and Organizations </li></ul><ul><li>Returned as an OWL encoded ontology </li></ul><ul><li>In this case we just make use of places </li></ul>
  10. 10. GeoLocation <ul><li>Converting a place name into a location </li></ul><ul><li>State College, PA -> (40.7934, -77.86) </li></ul><ul><li>Call the GeoNames web service to carry out a gazetteer lookup on the name. </li></ul>
  11. 11. Disambiguation <ul><li>Which London did you mean? </li></ul>
  12. 12. Types of Ambiguity <ul><li>Geo/Geo </li></ul><ul><ul><li>London, UK vs London, Ontario </li></ul></ul><ul><ul><li>South Wales, UK vs New South Wales, Au </li></ul></ul><ul><ul><li>Paris, France vs Paris, Texas </li></ul></ul><ul><li>Geo/Non Geo </li></ul><ul><ul><li>Washington, DC vs George Washington </li></ul></ul><ul><ul><li>Van, Turkey vs delivery van </li></ul></ul><ul><ul><li>West Nile, Egypt vs West Nile Virus </li></ul></ul>
  13. 13. Geocoded papers
  14. 14. % Correctness by Country
  15. 15. Obvious Problems <ul><li>In MeSH – Missing in FactXtractor </li></ul><ul><ul><li>Great Britain, England </li></ul></ul><ul><ul><li>European Union </li></ul></ul><ul><ul><li>Czechoslovakia, USSR </li></ul></ul><ul><li>Over represented in FactXtractor </li></ul><ul><ul><li>Hong Kong </li></ul></ul><ul><ul><li>Thailand </li></ul></ul><ul><ul><li>Pennsylvania </li></ul></ul><ul><li>Under represented in FactXtractor </li></ul><ul><ul><li>Australia </li></ul></ul>
  16. 16. Why? <ul><li>Mismatches between gazetteers </li></ul><ul><ul><li>European Union not in gazetteer </li></ul></ul><ul><ul><li>Great Britain -> United Kingdom </li></ul></ul><ul><li>Virus names that contain locations </li></ul><ul><ul><li>avian A/Mallard/Pennsylvania/10218/84 (H5N2) </li></ul></ul><ul><li>Named Entity Extractor error </li></ul><ul><ul><li>Considered “new” a stop word if it is next to a direction so New South Wales became South Wales. </li></ul></ul>
  17. 17. Other interesting errors <ul><li>De expanded to Delaware </li></ul><ul><li>La expanded to Louisiana </li></ul><ul><li>(every time e.g. Rio de Janeiro) </li></ul>
  18. 18. Conclusions <ul><li>MeSH provides a useful “Gold Standard” for verifying geocoding of PubMed documents. </li></ul><ul><li>Take care that your standard and geocoder use the same gazetteer. </li></ul><ul><li>How you define the footprint of a document is not as easy as it appears at first sight. </li></ul><ul><li>Some geocoder bugs are subtle and hard to spot. </li></ul>

×