• Like
Geographic Information Retrieval From Disparate Data Sources
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Geographic Information Retrieval From Disparate Data Sources

  • 1,882 views
Published

 

Published in Technology , Health & Medicine
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Applied microbiology student, EBSU, ABAKALIKI.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,882
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
68
Comments
1
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Geographic Information Retrieval from Disparate Data Sources Ian Turton, Anuj Jaiswal, Mark Gahegan GeoVISTA Center, School of Geography, Pennsylvania State University ijt1,arj135,mng1@psu.edu
  • 2. Summary
    • Information Retrieval?
    • Geographic?
    • Disparate Data Sources?
    • Does it work?
    • Semantics and Ontologies, do they help?
    • Further work?
    • Conclusions
  • 3. Information Retrieval
    • Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web.
    • Wikipedia
  • 4. OR more simply
    • Is there some way I can avoid reading all 19,000 of those articles about measles and still sound like I know what I’m talking about at the next conference?
  • 5. Geography
    • Well we all know that geography is important.
    • Depending on who you ask more than 80% of all information contains a geographic element.
    • Explicit:
      • Has a map coordinate
    • Implicit:
      • Has a place name
  • 6. Disparate Data Sources
    • Large collections of text containing implicit geographic references about Avian Flu and Measles:
      • PubMed abstracts
      • News Feeds (RSS)
      • WHO incident reports
  • 7. Building the System
    • Acquire data
    • Extract geographic information
    • Extract semantic and ontological information
    • Present in a form that allows easy exploration by users.
  • 8. Acquire Data
    • First extract abstracts from PubMed
    • http://eutils.ncbi.nlm.nih.gov/entrez/eutils/
    • ((avian OR bird) AND (influenza OR flu)) OR H5N1
    • Returns a structured XML file with citation data and abstract for selected papers.
    • Process XML into PostGIS database
  • 9. Extract Geographic Entities
    • Use FactXtractor (http://julian.mine.nu/snedemo.html)
    • Uses GATE to detect and extract Named Entities and Entity Relationships
    • Usually finds People , Places and Organizations
    • Returned as an OWL encoded ontology
    • In this case we just make use of places
  • 10.
    • <rdf:RDF xml:base=&quot;http://ist.psu.edu/sna/ontology#&quot;>
    • <owl:Class rdf:ID=&quot;Location&quot;/>
    • <owl:Class rdf:ID=&quot;Organization&quot;/>
    • <owl:Class rdf:ID=&quot;Person&quot;/>
    • <owl:DatatypeProperty rdf:ID=&quot;counts&quot;/>
    • <Location rdf:ID=&quot;Africa&quot;>
    • <counts>1</counts>
    • <mentioned_in>
    • <_Article rdf:ID=&quot;InputString0&quot;>
    • </_Article>
    • </mentioned_in>
    • </Location>
    • <Location rdf:ID=&quot;Asia&quot;>
    • <counts>1</counts>
    • <mentioned_in rdf:resource=&quot;#InputString0&quot;/>
    • </Location>
    • <Location rdf:ID=&quot;Vietnam&quot;/>
    • <Location rdf:ID=&quot;South_East&quot;/>
    • <Location rdf:ID=&quot;Europe&quot;>
    • <counts>1</counts>
    • <mentioned_in rdf:resource=&quot;#InputString0&quot;/>
    • </Location>
    • </rdf:RDF>
  • 11. GeoLocation
    • Converting a place name into a location
    • State College, PA -> (40.7934, -77.86)
    • Call the GeoNames web service to carry out a gazetteer lookup on the name.
  • 12. Disambiguation
    • Which London did you mean?
  • 13. Types of Ambiguity
    • Geo/Geo
      • London, UK vs London, Ontario
      • South Wales, UK vs New South Wales, Au
      • Paris, France vs Paris, Texas
    • Geo/Non Geo
      • Washington, DC vs George Washington
      • Van, Turkey vs delivery van
      • West Nile, Egypt vs West Nile Virus
    • Sort of Ambiguous
      • avian A/Mallard/Pennsylvania/10218/84 (H5N2) influenza virus strains
  • 14. Disambiguating Multiple Places
    • Choose A if A is a Political Entity and B is not,
    • Choose B if B is a Political Entity and A is not,
    • Choose A if A is a Region and B is not,
    • Choose B if B is a Region and A is not,
    • Choose A if A is an Ocean and B is not,
    • Choose B if B is an Ocean and A is not,
    • Choose A if A is a Populated Place and B is not,
    • Choose B if B is a Populated Place and A is not,
    • Choose A if A's population is greater than B's,
    • Choose B if B's population is greater than A's,
    • Choose A if A is an Administrative Area and B is not,
    • Choose B if B is an Administrative Area and A is not,
    • Choose A if A is a Water Feature and B is not,
    • Choose B if B is a Water Feature and A is not,
    • Choose A.
  • 15. Solving Geo/Non Geo Ambiguity
    • Stop word lists – hand crafted by experience
    • Province, valley, way, hill, Children, Children's, new, cross, red, clinic, general, côte, ii, iii, bas, pays, chem, northern region, eastern region, central region, southern region, region, off, square, census, islands, city, district, park, USA, State, Virology, Microbiology, Immunology, Medical, Science, Employee, Surveillance, Disease, Biochemistry, Prevention, for, and, mail, natl, dept, dev, agr, Rural, inst, mil, med, coll, Internal, Publ, Bur, Hosp, Jude, Childrens, Chai, yan, Virol, Dis, Div, Enter, Cent, lab, Univ, res, ist, prevent, roc, prod, Roche, vet, castle, peak, stat, garden, Atl, Anim, mar, queen, central, Director, LAT, AC-EIA, register, north, east, south, west, northern, southern, eastern, western
  • 16. Concept Extraction
    • Automatically extract keywords or tags from article abstracts by
      • Selecting keywords which exceed a preset frequency.
      • Passing text through Yahoo! tagging service, returns key phrases using latent semantic indexing.
  • 17. Store everything in a big database
    • Open up PostGIS and stuff in all the data keyed by article id.
      • Article
        • Citation data – authors, title, abstract, journal, volume, issue, etc
      • Places
        • Name, Country, Latitude, Longitude, etc
      • Concepts
        • Key phrase or word
  • 18. Provide Intuitive Front End for Users
    • Tag Cloud
      • Popularized on many web 2.0 sites such as Flickr, del.icio.us, citeUlike.org etc.
  • 19. Place Cloud
  • 20. Author Cloud
  • 21. Choose a tag
  • 22. Choose a place
  • 23. Select a child of the place
  • 24. Tag limited by place
  • 25. Implementation
    • Initially implemented as a java servlet using JDBC link to PostGIS
    • Reimplemented using Ruby on Rails in last week using ActiveRecord to PostGIS
    • In page mapping OpenLayers WMS map client to GeoServer over PostGIS.
  • 26. Semantics and Ontologies
    • Geographic ontology is provided by GeoNames semantic web service.
    • A query allows the look up of parent, children and nearby features for most features.
    • Results are cached in PostGIS database to save processing time and load on server.
  • 27. WordNet Ontology
  • 28. Conclusions
    • It is possible to construct a useful system to ingest arbitrary text and extract place names.
    • A sufficiently good automated location disambiguation system can be built for a specific domain to process 80-90% of places correctly.
    • Semantic expansion and narrowing of searches appears useful in early experiments.
    • Providing users with a familiar, and highly linked, interface seems to aid exploration of the document space.