Geographic Information Retrieval From Disparate Data Sources


Published on

Published in: Technology, Health & Medicine
1 Comment
  • Applied microbiology student, EBSU, ABAKALIKI.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Geographic Information Retrieval From Disparate Data Sources

  1. 1. Geographic Information Retrieval from Disparate Data Sources Ian Turton, Anuj Jaiswal, Mark Gahegan GeoVISTA Center, School of Geography, Pennsylvania State University ijt1,arj135,
  2. 2. Summary <ul><li>Information Retrieval? </li></ul><ul><li>Geographic? </li></ul><ul><li>Disparate Data Sources? </li></ul><ul><li>Does it work? </li></ul><ul><li>Semantics and Ontologies, do they help? </li></ul><ul><li>Further work? </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Information Retrieval <ul><li>Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web. </li></ul><ul><li>Wikipedia </li></ul>
  4. 4. OR more simply <ul><li>Is there some way I can avoid reading all 19,000 of those articles about measles and still sound like I know what I’m talking about at the next conference? </li></ul>
  5. 5. Geography <ul><li>Well we all know that geography is important. </li></ul><ul><li>Depending on who you ask more than 80% of all information contains a geographic element. </li></ul><ul><li>Explicit: </li></ul><ul><ul><li>Has a map coordinate </li></ul></ul><ul><li>Implicit: </li></ul><ul><ul><li>Has a place name </li></ul></ul>
  6. 6. Disparate Data Sources <ul><li>Large collections of text containing implicit geographic references about Avian Flu and Measles: </li></ul><ul><ul><li>PubMed abstracts </li></ul></ul><ul><ul><li>News Feeds (RSS) </li></ul></ul><ul><ul><li>WHO incident reports </li></ul></ul>
  7. 7. Building the System <ul><li>Acquire data </li></ul><ul><li>Extract geographic information </li></ul><ul><li>Extract semantic and ontological information </li></ul><ul><li>Present in a form that allows easy exploration by users. </li></ul>
  8. 8. Acquire Data <ul><li>First extract abstracts from PubMed </li></ul><ul><li> </li></ul><ul><li>((avian OR bird) AND (influenza OR flu)) OR H5N1 </li></ul><ul><li>Returns a structured XML file with citation data and abstract for selected papers. </li></ul><ul><li>Process XML into PostGIS database </li></ul>
  9. 9. Extract Geographic Entities <ul><li>Use FactXtractor ( </li></ul><ul><li>Uses GATE to detect and extract Named Entities and Entity Relationships </li></ul><ul><li>Usually finds People , Places and Organizations </li></ul><ul><li>Returned as an OWL encoded ontology </li></ul><ul><li>In this case we just make use of places </li></ul>
  10. 10. <ul><li><rdf:RDF xml:base=&quot;;> </li></ul><ul><li><owl:Class rdf:ID=&quot;Location&quot;/> </li></ul><ul><li><owl:Class rdf:ID=&quot;Organization&quot;/> </li></ul><ul><li><owl:Class rdf:ID=&quot;Person&quot;/> </li></ul><ul><li><owl:DatatypeProperty rdf:ID=&quot;counts&quot;/> </li></ul><ul><li><Location rdf:ID=&quot;Africa&quot;> </li></ul><ul><li><counts>1</counts> </li></ul><ul><li><mentioned_in> </li></ul><ul><li><_Article rdf:ID=&quot;InputString0&quot;> </li></ul><ul><li></_Article> </li></ul><ul><li></mentioned_in> </li></ul><ul><li></Location> </li></ul><ul><li><Location rdf:ID=&quot;Asia&quot;> </li></ul><ul><li><counts>1</counts> </li></ul><ul><li><mentioned_in rdf:resource=&quot;#InputString0&quot;/> </li></ul><ul><li></Location> </li></ul><ul><li><Location rdf:ID=&quot;Vietnam&quot;/> </li></ul><ul><li><Location rdf:ID=&quot;South_East&quot;/> </li></ul><ul><li><Location rdf:ID=&quot;Europe&quot;> </li></ul><ul><li><counts>1</counts> </li></ul><ul><li><mentioned_in rdf:resource=&quot;#InputString0&quot;/> </li></ul><ul><li></Location> </li></ul><ul><li></rdf:RDF> </li></ul>
  11. 11. GeoLocation <ul><li>Converting a place name into a location </li></ul><ul><li>State College, PA -> (40.7934, -77.86) </li></ul><ul><li>Call the GeoNames web service to carry out a gazetteer lookup on the name. </li></ul>
  12. 12. Disambiguation <ul><li>Which London did you mean? </li></ul>
  13. 13. Types of Ambiguity <ul><li>Geo/Geo </li></ul><ul><ul><li>London, UK vs London, Ontario </li></ul></ul><ul><ul><li>South Wales, UK vs New South Wales, Au </li></ul></ul><ul><ul><li>Paris, France vs Paris, Texas </li></ul></ul><ul><li>Geo/Non Geo </li></ul><ul><ul><li>Washington, DC vs George Washington </li></ul></ul><ul><ul><li>Van, Turkey vs delivery van </li></ul></ul><ul><ul><li>West Nile, Egypt vs West Nile Virus </li></ul></ul><ul><li>Sort of Ambiguous </li></ul><ul><ul><li>avian A/Mallard/Pennsylvania/10218/84 (H5N2) influenza virus strains </li></ul></ul>
  14. 14. Disambiguating Multiple Places <ul><li>Choose A if A is a Political Entity and B is not, </li></ul><ul><li>Choose B if B is a Political Entity and A is not, </li></ul><ul><li>Choose A if A is a Region and B is not, </li></ul><ul><li>Choose B if B is a Region and A is not, </li></ul><ul><li>Choose A if A is an Ocean and B is not, </li></ul><ul><li>Choose B if B is an Ocean and A is not, </li></ul><ul><li>Choose A if A is a Populated Place and B is not, </li></ul><ul><li>Choose B if B is a Populated Place and A is not, </li></ul><ul><li>Choose A if A's population is greater than B's, </li></ul><ul><li>Choose B if B's population is greater than A's, </li></ul><ul><li>Choose A if A is an Administrative Area and B is not, </li></ul><ul><li>Choose B if B is an Administrative Area and A is not, </li></ul><ul><li>Choose A if A is a Water Feature and B is not, </li></ul><ul><li>Choose B if B is a Water Feature and A is not, </li></ul><ul><li>Choose A. </li></ul>
  15. 15. Solving Geo/Non Geo Ambiguity <ul><li>Stop word lists – hand crafted by experience </li></ul><ul><li>Province, valley, way, hill, Children, Children's, new, cross, red, clinic, general, côte, ii, iii, bas, pays, chem, northern region, eastern region, central region, southern region, region, off, square, census, islands, city, district, park, USA, State, Virology, Microbiology, Immunology, Medical, Science, Employee, Surveillance, Disease, Biochemistry, Prevention, for, and, mail, natl, dept, dev, agr, Rural, inst, mil, med, coll, Internal, Publ, Bur, Hosp, Jude, Childrens, Chai, yan, Virol, Dis, Div, Enter, Cent, lab, Univ, res, ist, prevent, roc, prod, Roche, vet, castle, peak, stat, garden, Atl, Anim, mar, queen, central, Director, LAT, AC-EIA, register, north, east, south, west, northern, southern, eastern, western </li></ul>
  16. 16. Concept Extraction <ul><li>Automatically extract keywords or tags from article abstracts by </li></ul><ul><ul><li>Selecting keywords which exceed a preset frequency. </li></ul></ul><ul><ul><li>Passing text through Yahoo! tagging service, returns key phrases using latent semantic indexing. </li></ul></ul>
  17. 17. Store everything in a big database <ul><li>Open up PostGIS and stuff in all the data keyed by article id. </li></ul><ul><ul><li>Article </li></ul></ul><ul><ul><ul><li>Citation data – authors, title, abstract, journal, volume, issue, etc </li></ul></ul></ul><ul><ul><li>Places </li></ul></ul><ul><ul><ul><li>Name, Country, Latitude, Longitude, etc </li></ul></ul></ul><ul><ul><li>Concepts </li></ul></ul><ul><ul><ul><li>Key phrase or word </li></ul></ul></ul>
  18. 18. Provide Intuitive Front End for Users <ul><li>Tag Cloud </li></ul><ul><ul><li>Popularized on many web 2.0 sites such as Flickr,, etc. </li></ul></ul>
  19. 19. Place Cloud
  20. 20. Author Cloud
  21. 21. Choose a tag
  22. 22. Choose a place
  23. 23. Select a child of the place
  24. 24. Tag limited by place
  25. 25. Implementation <ul><li>Initially implemented as a java servlet using JDBC link to PostGIS </li></ul><ul><li>Reimplemented using Ruby on Rails in last week using ActiveRecord to PostGIS </li></ul><ul><li>In page mapping OpenLayers WMS map client to GeoServer over PostGIS. </li></ul>
  26. 26. Semantics and Ontologies <ul><li>Geographic ontology is provided by GeoNames semantic web service. </li></ul><ul><li>A query allows the look up of parent, children and nearby features for most features. </li></ul><ul><li>Results are cached in PostGIS database to save processing time and load on server. </li></ul>
  27. 27. WordNet Ontology
  28. 28. Conclusions <ul><li>It is possible to construct a useful system to ingest arbitrary text and extract place names. </li></ul><ul><li>A sufficiently good automated location disambiguation system can be built for a specific domain to process 80-90% of places correctly. </li></ul><ul><li>Semantic expansion and narrowing of searches appears useful in early experiments. </li></ul><ul><li>Providing users with a familiar, and highly linked, interface seems to aid exploration of the document space. </li></ul>