Geographic Information Retrieval Systems.


Published on

Used for a course project in my Final Semester at DA-IICT

Published in: Technology
1 Comment
1 Like
  • Hey Dinesh .....I had studied your slides on GIR. Since I am also working on this Topic. I need to take some help from you. Please assist me. Do give your social contact so that I can talk to you. Mine is
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Geographic Information Retrieval Systems.

  1. 1. Geographic Information Retrieval
  2. 2. An Overview
  3. 3. Problem Statement:
  4. 4. Introduction ● Geographic Information Retrieval can be seen as a specialized branch of traditional Information Retrieval.● Information that has relationships to geographic space is called georeferenced information and frequently used term in Georeferenced Information Retrieval.● Georeferenced information is used in all kinds of media, Eg :- Structured data like maps, land surveys, airborne and satellite images and tabulated observations.● Can also be used by researchers looking for certain area, or requiring particular area inhabited by certain animals or is affected by an epidemic.
  5. 5. Properties of Georeferenced Information:● Information available in digital libraries and on the Internet is georeferenced, although mostly it is not denoted in terms of geographic coordinates.● The geographical location and extension of a place name is often called geographic footprint and it is given by coordinates ( longitude, latitude ).● Geographic Information Retrieval requires that place names and phrases that include direct or indirect references to place names be resolved and translated into footprints that can be indexed.
  6. 6. General Problems in GIR:Ambiguity/Lack of precision in Place Names: ● Firstly, several places can share the same name, making the place names unique only within a limited geographic area. ● Secondly, some place names occurring in texts are temporal or cultural conventions rather than official names, requiring the user to have an understanding of the time, context or cultural environment the place names are used in to be able to link it to some geographic location. ● Thirdly, some place names change over time. eg. Banglore to Bengaluru, Calcutta to Kolkata etc.. ● Fourthly, the geographic extension that the place name denotes can be extended, reduced or changed over time.
  7. 7. General Problems in GIR: (contd.) ○ Fifthly, the borders of a location can be fuzzy. (Kashmir?) ○ The same place name can be written differently in different text, either because the author has misspelled the name or because there are different legal spellings of the same place name.Information being fuzzy : ○ About 200 kilometers south of the capital of Russia” . Direction may vary, distance may vary. In case of South Africa there are 3 capitals which may lead to ambiguity. ○ Often, people are imprecise in giving geographic direction, using one of the four general directions north, south, east or west, when the actual direction might be somewhere in between.
  8. 8. Impact of cognitive model on Geographic Information Retrieval● Human understanding of the geographic loaction: Procedural and Survey based.● Survey: Involves looking at maps and geographic location finding.● Procedural: Involves exploring and navigating through the place so as to get the feel of it.● Using procedural method to locate or gain information is particularly difficult as it contains many phrases involving human ambiguity.
  9. 9. Cognitive model (continued)● People link geographic distance with time.: People when talking about going from say a to b have a tendency of using time as a method of asserting It takes two hours to reach from A to B by car.● Topology and metric distances: People are very good at mentioning topological aspects pertaining to a place. Like inclusion (eg: names of the topologies in an area.) or coincidences (eg: this place is at the same place as..)● People have biases towards east-west or north-south direction: People have a very biased view of the geographical area. And while giving specifics in direction, they seem to have a vague sense of direction. eg: When asked where is south america w.r.t to north america. The answer generally is south. While the really it is in the south-east.
  10. 10. Geo referencing using the Gazetteers Gazetteers: A form of index that relates place names to co-ordinates of locations andextents.Here we are going to focus on automatic geo-referencing based on the contents of thedocuments text aloneIn an automated approach most projects have based their approaches to georeferencingon a combination of place name identification and natural language processing to identifyphrases that modifies the location pointed to by occurrences of place names (“200 kmsouth of the Moskow”) or that provides georeferences that indicates a geo-referencewithout actually mentioning a specific place name (“Rosenborgs homefield”).
  11. 11. Geo- referencing (continued)Gazetteers have three basic components:The name is the textual designator of a geographic location, the location is the coordinatesof a point, line or area on the earth’s surface pointed to by a name, and the feature type isthe type of location that a name points to(Forrest, agricultural area, river, inhabited location etc).The location that a place namerefers to (the place names footprint) can be given as a point, a bounding box or a polygon,all represented by coordinates.
  12. 12. Geo-referencing (continued)Centroid point:Vague in terms of geometry and size of the area.Little data storage.
  13. 13. Geo- referencing (continued)Bounding Box:Gives a better idea of the entire referenced area.Does not require a lot of data storage.However it overlaps other areas around it and is inaccurate.
  14. 14. Geo-referencing (continued)Approximated Polygon approach:Most accurate in terms of referencing.However takes a lot of data storage space.The best approach would be to have something in the middle of the polygon and boundedbox approach like a fixed points polygon approach.
  15. 15. Searching for Georeferenced InformationLetting the user specify one or more place names in as keywords in a traditional keywordbased query. When parsing the query, the GIR/IR treats the found place names as specialkeywords by the GIR/IR system, indicating the geographical scope of the information needof the user.e.g: Googling for Restaurants around you?Letting users specify the geographic constraint to a query by drawing on one or moremaps.e.g: Google Mapsand what about GPS Apps like "Here and Now", "Google Latitude"?
  16. 16. Searching for Georeferenced InformationTypical Queries: ○ Point in Polygon - asking for georeferenced information that contains, surrounds or refers to a particular geographic point location ○ Region Queries - asking for anything contained in, adjacent to, or overlaps the region. ○ Distance and Buffer Zone Queries - asking for information within some fixed distance of a geographic object (point, line, polygon) ○ Path Queries - asking for the presence of a network structure that can be queried for network traversal information ○ Multimedia Queries - combining multiple geo-referenced information sources in resolving a query.
  17. 17. Related Projects:SPIRIT:(Spatially-aware information retrieval on the internet) - funded by the ECFifth Framework Programme. To improve the search capabilities on the internet by usinggeographical and conceptual ontologies to model both vocabulary and the spatial structureof places for purposes of IR.This ontology, which is envisioned as an extension to traditionalgazetteers and related locations as well as help ranging hits based on geographicproperties.∙ ontologies that model geographical terminology;∙ query expansion and relevance ranking procedures based onthe geographical ontologies;∙ machine learning techniques for the extraction ofgeographical context from web documents and for generatingmetadata providing spatial context;∙ a multi-modal user interface providing textual input andinteractive map feedback of the context of retrieveddocuments;∙ spatial indices for web collections
  18. 18. Geo-OntologiesOntologies relating Geographical Terminology and Spatial Relationships ● Reference to a geographic place: <PL-Name,PL-Type,{(x,y)}> ○ eg: <Charminar, Monument,{(x,y)}> ● Relative Place Reference : <Spatial Relationship,PL-Name, Type,PL-FP> ○ eg: <In, Hyderabad, City, {(x,y)}>A Query to SPIRIT will contain one or more references to a PL-REFGeographic content is a set of <Place reference> expressions and the Geometric Footprintis a function of this set.Basically Geo Ontologies can be applied in :1) Users query interpretation: (+ domain specific ontologies) for disambiguation of placename2) System query formulation: to generate alternate names and spatially associated names3) Metadata extraction: to extract info from free text documents to generate foot print(s)4) Relevance Ranking: potential for geographical relevance ranking (Dominos Pizza? :) )
  19. 19. Geo-Ontologies Ontology"formal, explicit specification of a shared conceptualisation"
  20. 20. Geo-Ontologies ● Types of Atomic Queries: ○ A place name ○ An aspatial entity with relation to a place name ○ An aspatial entity with a spatial relation to a place name ○ An aspatial entity with a spatial relation to a place name ○ A place name with spatial relation to a place name ○ A place type with spatial relation to a place name ○ A place type with spatial relation to a place type ● Geo Ontology = Geographic Feature Ontology + Geographic Type Ontology + Spatial Relation Ontology
  21. 21. User evaluation of the spirit prototype gave consistent results with SPIRIT priorities oninnovative features. Yet, users explain a feeling of frustration which highlights that theirrequirements are beyond SPIRIT achievements and that there is still more work to bedone in this area.The last publication on the website dates back to 2005.
  22. 22.                         Relevance In Information Retrieval, relevance denotes how well a retrieved document or set ofdocuments meets the information need of the user.Geographic Information Retrieval is concerned with retrieving documents in response to aspatially related query. Thus, the ranking of documents by both textual and spatialrelevance have to be considered.The most common way to return a set of documents obtained from a Web query is bya ranked list. The search engine attempts to determine which document seems to be themost relevant to the user and will put it first in the list. In short, every document receivesa score, or distance to the query, and the returned documents are sorted by this score ordistance.There are situations where the sorting by score may not be the most useful one. Whena more complex query is done, composed of more than one query term or aspect,documents can also be returned with two or more scores instead of one.
  23. 23. For example, the Web search could be for campings in the neighborhood ofNeuschwanstein, and the documents returned ideally have a score for the queryterm “camping” and a score for the proximity to Neuschwanstein. This implies that a Webdocument resulting from this query can be mapped to a point in the 2-dimensional plane,where both axes represent a score. The map indicates campings near the castleNeuschwanstein, which is situated close to Schwangau, with the distance to the castleon the x-axis and the rating on the y-axis.
  24. 24.              Another weakness of our methods lies in the way we treat multiple-footprint documents.While we assume that a query can have only one footprint (a user is interested in only onelocation), documents may have multiple footprints (refer to more than one location). The method we followed so far in order to calculate the spatial score considers only thebest-matching document footprint. For example, if a user is looking for “airports nearLondon”, a document that refers to both “Gatwick” and “Stansted” is scored as referringonly to “Gatwick” since it’s the nearest airport of the two. Such a document, however,should be scored higher than another that refers only to “Gatwick” since it provides morerelevant information. Another thing is , the number of footprints occurring: Gatwick’sofficial web-pages should be more important than a web-list of all airports in UK.
  25. 25.               For high-quality ranking two things are required. Firstly, we need a good spatial scorebetween query and document footprints. Secondly, we need a good combination of thespatial and textual (BM25) scores.For finding spatial scores, the spatial relationships (distance, containment, and direction)were converted into numeric values that indicate how close, how much inside, or howmuch North-of the relationship between two objects is. Those numeric values were firstattempts at obtaining a score to quantify spatial relationships.However, certain issues do come up in this method. For example, let us assume threecities, A, B, and C, where A lies in equal distance (in a Euclidean sense) from B and C. IfC is bigger than B, then the score of B being close to A should be lower than that of Cbeing close to C. In other words, the distance scores of cities around A may depend on thecontext, i.e. which other cities are around A. Also, natural barriers can influence theconcept of proximity. It matters a lot whether a distance of 10 km (as the crow flies) can becovered by a direct road, or requires a large detour around a mountain range (or a smallroad over a mountain pass)
  26. 26.              In traditional information retrieval, the separate scores of each document would becombined into a single score (e.g., by a weighted sum or product) which produces theranked list by sorting.Now, we are going to incorporate two pieces of information into the way that a spatialdocument score is calculated:• The number n of unique footprints in a document.• The frequencies f_1,…, f_n, of occurrence of the footprints in the document.Moreover, the total spatial score of a document will be derived from fractional scorecontributions of all occurring document footprints.
  27. 27.                A simple way of taking into account all document footprints is to define the total spatialscore as a linear combination (e.g. the simple average) of the individual scores of thefootprints:S = 1/n * (s_1+…+s_n)where s_i is the score of the ith document footprint in respect to the queryfootprint. Incorporating also the frequencies of occurrence f_i, let us define the weight ofa footprint:tf_i = 1 + log (f_i).A footprint that occurs in the document only once will get a weight of one, where any extraoccurrences will increase the weight in a log fashion. The total score may be calculated asS = 1/(tf_1+…+tf_n) * (tf_1*s_1+…+tf_n*s_n),that is the weighted average of the individual scores.
  28. 28.               Considering again the example about “airports near London”, such a scoring function likethe last one would score higherGatwick’s official web-page than a web-list of all UK airports. Moreover, it takes intoaccount more than the best-matched document footprint. The last formula may serve as astarting point for improving the spatial scoring function.
  29. 29. Evaluation:2 Indicators:1) Recall = No. of Relevant Docs returned / Total No. of rel. Docs2) Precission = No. of relevant Docs returned / Total No. of Indexed DocsTrec has been evaluated using the ISO 9241 standard: based on Effectiveness (can usersfind relevant docs?) , Efficiency (resourcs consumed per result) and Satisfaction (Userfeedback)
  30. 30. Gazetteer Server and Service for UK Academia - James ReidGazetteer :- Geographical dictionary or directory. Serves as reference for information aboutplaces. ● Geographic searching is powerful information retrieval tool, because the results obtained hereafter are more specific. ● Geographic searching is restricted because Geographic metadata creation is very resource intensive and the resources having geographic metadata exists only to names. ● There is no particular mentioning of the geographic footprint i.e. directly. There might be direct or indirect reference to the place. Constant change in Geographic metadata:- ● Names of places may vary. ● Names may have changed from time to time. ● Boundaries can be fuzzy. ● Spoken in some context.
  31. 31. GeoXwalk is a comprehensive Gazetteer linking vocabulary ofcurrent and historical geographical names to a standard spatialcoding scheme ( longitude, latitude ).Technically GeoXwalk has basically three components :- ● Gazetteer database to support spatial searches. ● Middleware components to issue spatial/aspatial queries. ● Geo parser to parse non geographically indexed documents for some place name as reference to it.
  32. 32. Gazetteer databaseEach geographical feature must include :- ● Feature name. ● Feature type. ● Geometry ( spatial footprints ).Marking out the places can be done better by using Polygons as opposed to Points.Explicit relationships can be defined which is of particular use when Gazetteer holdsignificant amount of historical data for which geometries doesnt exist.Middleware components:Protocols supported by geoXwalk are:- ● ADL Gazetteer protocol ● OGC filter encoding implementation.This is to translate XML queries to database specific SQL queries.
  33. 33. GeoParserMost data and metadata existing have some sort of geo-reference that is not in formatwhich will allow it to be easily spatially searched.One task associated is how non spatially referenced documents could be spatially indexed.Could be done using a Gazetteer as reference.Prototype based geo-parser has been implemented that semi automatically identifies placename in a document and extract a suitable spatial footprint.The rule based approach takes in account the structure and context in which words occur.One issue that is faced by GeoXwalk are Map conflation i.e. detecting duplicate entries.Like a place spoken differently in different language but has a same geographic footprint.
  34. 34. Related Projects: GeoVSMGeographic Vector Space Model: The project integrates coordinate basedgeographic indexing with the key-word based vector space model in are presentinginformation space. Relevance measures are based on both geographic measures and onthematic measures which can be combined into one single measure system.Vector Space Model: One of the most popular models of document space developedin textual-based information retrieval research. It is an algebraic model for representingtext or graphical documents (and any objects, in general) as vectors of identifiers.Using a vector space model, the content of each geographic document can beapproximately described by a vector of (content-bearing) terms, which are a combination ofthematicsubjects and place names. ● Documents and queries are represented as vectors. Each dimension corresponds to a separate term An information retrieval system stores a representation of adocument collection using a document-by-term matrix, where the element at position (i, j)corresponds to the frequency of occurrence of term j in the ith document. In the vectorspace model, all the objects (terms, documents, queries, concepts, etc) can be similarlyrepresented as vectors. ● Vector space model is well accepted as an effective approach in modelling thematic
  35. 35. However, the vector space model has some serious problems when used formodeling the geographic subspace. The geographic space is inherently continuous and cannot beadequately approximated using a set of place names (which are discrete in nature). if adocument mentions four place names—Pittsburgh, Philadelphia, Harrisburg, andHagerstown—the four place names will be treated as four independent dimensions in avector space model, whereas in fact, they are points (or regions) in a two-dimensionalgeographic space.Additional concerns of using locational terms as geographic indexes include: ambiguity inmeaning, non-unique place names, place name might change over time, and spellingvariations
  36. 36. Geographical Model ● Geographical model of document space is capable of processing arbitrarily complex spatial queries. ● The most common spatial are believed to be of three types:1.Point query: Return the geometric object that contains a given query point2.Region query :Given a region R, find all objects in the collection that intersect R3.Buffer zone :A buffer query involves two spatial data sets and a distance d. The answerto this query are pairs of objects, one from each input set, that are within distanced ofeach other. For e.g. “find house-power line pairs that are within 50 meters of eachother.” ● Spatial indexing based on coordinates generates persistent indexes for documents, since it is well defined and is immune from any changes in place names, political boundaries, and linguistic variations
  37. 37. VSM / Geographical model (contd..) ● Disadvantages of using the Geographical model in retrieving geographical information-There are considerable amount of geographical information existing in textual formsthat are not easily integrated into geographical model for mapping and spatialanalysis, due to the difficulties of natural language understanding for geo-referencingtext.-
  38. 38. GeoVSM● Model obtained by combining the advantages of both the geographical model and vector space model.● Each document will be indexed both by footprint (in geographical coordinate space) and by a term vector (in vector space).● Geographical indexes will only represent the geographical scope of the document, and term vectors will only represent thematic scope of documents
  39. 39.             Assume that any document has a limited geographic scope, GSd, anda thematic scope, TSd. Similarly, a query on a document collection also has a geographicscope, GSq and a thematic scope, TSq. The degree of relevance of a documentto a query can be determined by the following measure:Rel(d, q) = ƒ(SimG(GSd, GSq), SimT(TSd, TSq) ) (1)where SimG(•) measures the similarity (i.e., the degree of overlapping) between thegeographic scopes of the document and the query; SimT(•) measures the degree ofoverlapping between the thematic scopes of the document and the query; and ƒ(*) is afunction for combining relevance measures of geographic dimensions and thematicdimensions.
  40. 40. References* GeoVSM: An Integrated Retrieval Model for Geographic InformationGuoray CaiSchool of Information Sciences and TechnologyThe Pennsylvania State University002K Thomas Building, University Park, PA 16802***** Distributed Ranking Methods for Geographic Information Retrieval byMarc van Kreveld Iris Reinbacher Avi Arampatzis Roelof van Zwol