Slideshare.net (beta)

 

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 1 (more)

Geographic Information Retrieval From Disparate Data Sources

From ianturton, 8 months ago

298 views  |  0 comments  |  1 favorite  |  7 downloads
Embed
options

More Info

This slideshow is Public
Total Views: 298
on Slideshare: 298
from embeds: 0

Slideshow transcript

Slide 1: Geographic Information Retrieval from Disparate Data Sources Ian Turton, Anuj Jaiswal, Mark Gahegan GeoVISTA Center, School of Geography, Pennsylvania State University ijt1,arj135,mng1@psu.edu

Slide 2: Summary Information Retrieval?  Geographic?  Disparate Data Sources?  Does it work?  Semantics and Ontologies, do they help?  Further work?  Conclusions 

Slide 3: Information Retrieval Information retrieval (IR) is the science of  searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web. Wikipedia

Slide 4: OR more simply Is there some way I can avoid reading all  19,000 of those articles about measles and still sound like I know what I’m talking about at the next conference?

Slide 5: Geography Well we all know that geography is important.  Depending on who you ask more than 80% of  all information contains a geographic element. Explicit:  Has a map coordinate  Implicit:  Has a place name 

Slide 6: Disparate Data Sources Large collections of text containing implicit  geographic references about Avian Flu and Measles: PubMed abstracts  News Feeds (RSS)  WHO incident reports 

Slide 7: Building the System Acquire data  Extract geographic information  Extract semantic and ontological information  Present in a form that allows easy exploration  by users.

Slide 8: Acquire Data First extract abstracts from PubMed  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/  ((avian OR bird) AND (influenza OR flu)) OR  H5N1 Returns a structured XML file with citation  data and abstract for selected papers. Process XML into PostGIS database 

Slide 9: Extract Geographic Entities Use FactXtractor  (http://julian.mine.nu/snedemo.html) Uses GATE to detect and extract Named  Entities and Entity Relationships Usually finds People, Places and  Organizations Returned as an OWL encoded ontology  In this case we just make use of places 

Slide 10: <rdf:RDF xml:base="http://ist.psu.edu/sna/ontology#"> <owl:Class rdf:ID="Location"/> <owl:Class rdf:ID="Organization"/> <owl:Class rdf:ID="Person"/> <owl:DatatypeProperty rdf:ID="counts"/> <Location rdf:ID="Africa"> <counts>1</counts> <mentioned_in> <_Article rdf:ID="InputString0"> </_Article> </mentioned_in> </Location> <Location rdf:ID="Asia"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location> <Location rdf:ID="Vietnam"/> <Location rdf:ID="South_East"/> <Location rdf:ID="Europe"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location> </rdf:RDF>

Slide 11: GeoLocation Converting a place name into a location  State College, PA -> (40.7934, -77.86)  Call the GeoNames web service to carry out  a gazetteer lookup on the name.

Slide 12: Disambiguation Which London did you mean? 

Slide 13: Types of Ambiguity Geo/Geo  London, UK vs London, Ontario  South Wales, UK vs New South Wales, Au  Paris, France vs Paris, Texas  Geo/Non Geo  Washington, DC vs George Washington  Van, Turkey vs delivery van  West Nile, Egypt vs West Nile Virus  Sort of Ambiguous  avian A/Mallard/Pennsylvania/10218/84 (H5N2) influenza  virus strains

Slide 14: Disambiguating Multiple Places Choose A if A is a Political Entity and B is not, Choose B if B is a Political Entity and A is not, Choose A if A is a Region and B is not, Choose B if B is a Region and A is not, Choose A if A is an Ocean and B is not, Choose B if B is an Ocean and A is not, Choose A if A is a Populated Place and B is not, Choose B if B is a Populated Place and A is not, Choose A if A's population is greater than B's, Choose B if B's population is greater than A's, Choose A if A is an Administrative Area and B is not, Choose B if B is an Administrative Area and A is not, Choose A if A is a Water Feature and B is not, Choose B if B is a Water Feature and A is not, Choose A.

Slide 15: Solving Geo/Non Geo Ambiguity Stop word lists – hand crafted by experience  Province, valley, way, hill, Children, Children's, new, cross, red,  clinic, general, côte, ii, iii, bas, pays, chem, northern region, eastern region, central region, southern region, region, off, square, census, islands, city, district, park, USA, State, Virology, Microbiology, Immunology, Medical, Science, Employee, Surveillance, Disease, Biochemistry, Prevention, for, and, mail, natl, dept, dev, agr, Rural, inst, mil, med, coll, Internal, Publ, Bur, Hosp, Jude, Childrens, Chai, yan, Virol, Dis, Div, Enter, Cent, lab, Univ, res, ist, prevent, roc, prod, Roche, vet, castle, peak, stat, garden, Atl, Anim, mar, queen, central, Director, LAT, AC- EIA, register, north, east, south, west, northern, southern, eastern, western

Slide 16: Concept Extraction Automatically extract keywords or tags from  article abstracts by Selecting keywords which exceed a preset  frequency. Passing text through Yahoo! tagging service,  returns key phrases using latent semantic indexing.

Slide 17: Store everything in a big database Open up PostGIS and stuff in all the data  keyed by article id. Article  Citation data – authors, title, abstract, journal, volume,  issue, etc Places  Name, Country, Latitude, Longitude, etc  Concepts  Key phrase or word 

Slide 18: Provide Intuitive Front End for Users Tag Cloud  Popularized on many web 2.0 sites such as Flickr,  del.icio.us, citeUlike.org etc.

Slide 19: Place Cloud

Slide 20: Author Cloud

Slide 21: Choose a tag

Slide 22: Choose a place

Slide 23: Select a child of the place

Slide 24: Tag limited by place

Slide 25: Implementation Initially implemented as a java servlet using  JDBC link to PostGIS Reimplemented using Ruby on Rails in last  week using ActiveRecord to PostGIS In page mapping OpenLayers WMS map  client to GeoServer over PostGIS.

Slide 26: Semantics and Ontologies Geographic ontology is provided by  GeoNames semantic web service. A query allows the look up of parent, children  and nearby features for most features. Results are cached in PostGIS database to  save processing time and load on server.

Slide 27: WordNet Ontology

Slide 28: Conclusions It is possible to construct a useful system to ingest  arbitrary text and extract place names. A sufficiently good automated location  disambiguation system can be built for a specific domain to process 80-90% of places correctly. Semantic expansion and narrowing of searches  appears useful in early experiments. Providing users with a familiar, and highly linked,  interface seems to aid exploration of the document space.