Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The World of Geocoding and Challenges in India


Published on

The World of Geocoding and Challenges in India describes the geocoding facts, Addressing system in India and challenges for India Geocodier.

Published in: Technology

The World of Geocoding and Challenges in India

  1. The World of Geocoding and challenges in India Dr. Nishant Sinha
  2. Discussion • Introduction to Geocoding • All about spatial data and real world data • Addressing system of developing nation • Addressing system in India • Data sources and standardization • Data arrangement • Geocoding • Challenges • Steps to overcome challenges • Q & A
  3. Introduction to Geocoding
  4. What is Geocoding ? Geocoding is the process of transforming a description of a location (such as a pair of coordinates, an address, or a name of a place) to a location on the earth's surface Source Definition Possible Problems Environmental Sciences Research Institute (1999) The process of matching tabular data that contains location information such as street addresses with realworld coordinates. Limited to coordinate output only. Harvard University (2008) The assignment of a numeric code to a geographical location. Limited to numeric code output only. Statistics Canada (2008) The process of assigning geographic identifiers (codes) to map features and data records. Limited input range. U.S. Environmental Protection Agency (2008) The process of assigning latitude and longitude to a point, based on street addresses, city, state and USPS ZIP Code. Limited to coordinate output only.
  5. What is Geocoding ? Geocoding (verb) is the act of transforming aspatial locationally descriptive text into a valid spatial representation using a predefined process. A geocoder (noun) is a set of inter-related components in the form of operations, algorithms, and data sources that work together to produce a spatial representation for descriptive locational references. A geocode (noun) is a spatial representation of a descriptive locational reference. To geocode (verb) is to perform the process of geocoding.
  6. Geocoding ……… Some points to ponder……. • Does geocoding refer to a specific computational process of transforming something into something else, or simply the concept of a transformation? • Is a geocode a real-world object, simply an attribute of something else, or the process itself? • Is a geocoder the computer program that performs calculations, a single component of the process, or the human who makes the decisions?
  7. Spatial real world data
  8. What is there in Spatial data…. Location information • The fundamental primitive is the point, a 0-dimensional (0-D) object that has a position in space but no length. • A line is a 1-D geographic object having a length and is composed of two or more 0-D point objects. • A polygon is a geographic object bounded by at least three 1-D line objects or segments with the requirement that they must start and end at the same location (i.e., node)
  9. Spatial Data • Spatial data often referred to as layers • Layers represent features on, above, or below the surface of the earth • Data layers are of 2 major types • Vector data represent features as discrete points, lines, and polygons. • Raster data represent the landscape as a rectangular matrix of square cells. The Real World Vector Data Raster Data
  10. Spatial Data • Vector • TAB Files • Shape Files • CAD (AutoCAD DXF & DWG) • Raster Vector • Grids • Images • Digital Elevation Models (DEMs)
  11. Geocoding and Spatial Data • Most geocoding applications work with vector-based GIS data. • The key aspects from a geocoding perspective : • Determine and record the locations of these objects on the surface of the Earth, and • Calculate distance because many geocoding algorithms rely on one or more forms of linear interpolation.
  12. Addressing system of developing nation
  13. Addressing System • Addresses are one of the fundamental means by which people conceptualize location in the modern world • Addresses are of 2 Types • Relative input data • Examples of these types of data include “Across the street from Togo’s” and “The northeast corner of Vermont Avenue and 36th Place.” • Absolute input data • Example House no-xxxx, ABC Street, JJJJJ Locality, XYZ City, QQ State, DEF Country, ###### Postal Code
  14. Addresses • Come in a variety of formats • Address Components • Address (house or building) number • Prefix direction • Prefix type • Street name • Street type • Suffix direction • Zone
  15. What is street addressing? • Street addressing is an exercise that makes it possible to identify the location of a plot or dwelling on the ground, that is, to “assign an address” using a system of maps and signs that give the numbers or names of streets and buildings • Street addressing provides an opportunity to • Create a map of the city that can be used by different municipal units • Conduct a systematic survey that collects a significant amount of information about the city and its population, and • Set up a database on the built environment
  16. History of street addressing across some countries • Before 1728, no street names were indicated in Paris, except in very rare cases, such as “Rue Saint- Dominique, formerly des Vaches” (1643) • In Belgrade, changes to street names are frequent • The name China itself (country of the center), which makes reference to this principle, and in the name of the capitals, Beijing (capital of the North) and Nanjing (capital of the South). Names of the provinces are also strongly influenced by these references to geographical direction • Aside from a few main thoroughfares, streets in Japan do not have names. In fact, the city districts (ku) are divided into neighborhoods (chome) that group together several dozen houses and thus form a block.
  17. Street Addressing System Some common Practice • Sequential alternating numbering systems • Decametric numbering system • Codification of intersections • Combined addressing system Segment Distance (m) No. left side No. right side 1 0–10 1 2 2 10–20 3 4 3 20–30 5 6 4 30–40 7 8 Sequential alternating numbering systems Codification of intersections Combined addressing system
  18. Addressing system in India
  19. Addressing System in India • Spatial data capturing based on demand rather than homogeneous capture • Streets with no names or unstructured addresses • Absence of consistent and accurate dataset throughout the area being geocoded • Presence of slum-like areas that change frequently and are not street addressable • Non-existence of reference datasets or GIS data infrastructure • Lack of hierarchical data structure beyond tehsil • Absence of standardized geocoding algorithm • Non-existence of approach to validate assumptions made in the geocoding algorithm
  20. Sample Addresses varieties of India Address with Person Name c/o Yashwant S.Prabhu , 318, C - Wing, Suyog Co.Housing Society Ltd, T. P.S. Road & III Link Road, Vazira, Borivali, West Mumbai, Maharashtra, 400092 c/o Late Esmail Bagani, Y/2/122, Satghara Road, PO- Badartala, PS – Nadial, Kolkata, West Bengal, 700044 Address with Building names White C/403, Aamrpali Appt, opp. GHB complex, Ankur Road, Ahmedabad, Gujarat, 380013 13/9, Daksha Bldg, Vallabh Baug Lane, Ghatkopar, Mumbai, Maharashtra, 400077 Address with House No 299/15, Padmavati Vikar Mandal, Shahibaug, Ahmedabad, Gujarat, 380001 NO88, Srinivasa Nagar, 2NS Main Road, Kolathur, Chennai, Tamil Nadu 600099 Address with Street Name 1304, Cornation Road, Bargarpet, Kolar, Bangalore, Karnataka, 560000 20K, Dhakuria Station Road, Dhakuria PS: Jadavpur, Kolkata, West Bengal, 700031 Address with POI BMC Software, Next Muttha Chamber, Senapati Bapat Road, Pune, Maharashtra, 411016 Life Style International Pvt. Ltd., Near Payal Cinema Complex, Gurgaon, Haryana, 122001
  21. India Dynamics • 35 States & UTs, ~650 Districts , ~6000 Tehsil/Town, more than 30k postcodes • 21 regional , two official (Hindi & English) languages • 3.2m sq KM Area, 1.2 billion population • Administrative Hierarchy • State >>District>>Tehsil/Town>> Ward>>Locality/Village>>Sub locality>>Block/Pocket • Addressing Pattern • Near Govt. Hopital, Zirapur , District Rajgarh (MP) • 351 Ground Floor , Shakti khand 3 , Near St. Teresa School , Indirapuram, Ghaziabad 201010 U.P. • 9 Mansarovar Colony Opp. 3/686, Kala kuan Housing Board Alwar 301001 • 9/19/98/19-D Flat No. # 303 Hitech City Madhapur Hyderabad • 176 Devi Nagar New Sanganer Road Sodala , Jaipur Rajasthan Indian Postal Code
  22. The History of Postal System in India • Long colonial realm – British, Mughal, Portuguese……. • Britain's involvement in the postal services of India began in the eighteenth century • Warren Hastings (Governor General of British India from 1773-1784) opened the posts to the public in March 1774 • Main purpose of the postal system had been to serve the commercial interests of the East India Company and to serve Govt. orders
  23. The History of Postal System in India
  24. The History of Postal System in India
  25. Data sources and standardization
  26. India Geocoding Data 27 Streets • NH • SH • Local roads Place of Interest • Banks • Retails • Hospital • Other landmarks Administrative • State • District • Town • Locality • Sub locality Geography • Block • Locality • Town • Postcode Address Point • House no • Building name
  27. Reference datasets • The reference dataset is the underlying geographic database containing geographic features that the geocoder can use to generate a geographic output. • This dataset stores all of the information the geocoder knows about the world and provides the base data from which the geocoder calculates, derives, or obtains geocodes. Interpolation algorithms Type Example Vector line file U.S. Census Bureau’s TIGER/Line (United States Census Bureau 2008c) Vector polygon file Los Angeles (LA) County Assessor Parcel Data (Los Angeles County Assessor 2008) Vector point file Australian Geocoded National Address File (G-NAF) (Paull 2003)
  28. Reference dataset types • Linear-Based Reference Datasets • Roads, Ferries • Polygon-Based Reference datasets • Administrative Boundaries, Postal Codes • Point-Based Reference Datasets • POIs Source Description Coverage Cost Tele Atlas (2008c), NAVTEQ (2008) Building footprints, parcel footprints Worldwide, but sparse Expensive County or municipal Assessors Building footprints, parcel footprints U.S., but sparse Relatively inexpensive but varies U.S. Census Bureau Census Block Groups, Census Tracts, ZCTA, MCD, MSA, Counties, States U.S. Free Name Description Coverage U.S. Census Bureau’s TIGER/Line files (United States Census Bureau 2008c) Street centerlines U.S. NAVTEQ Streets (NAVTEQ 2008) Street centerlines Worldwide Tele Atlas Dynamap, MultiNet (Tele Atlas 2008a, c) Street centerlines Worldwide Supplier Product Description Coverage Government GeoNames (United States National Geospatial-Intelligence Agency 2008) Gazetteer of geographic features World, excepting U.S. Academia Alexandria Digital Library (2008) Gazetteer of geographic features World
  29. The geocoding algorithm • Geocoding Algorithm performs two basic tasks • Feature matching, • Feature interpolation,
  30. Input data processing Address normalization • Address normalization organizes and cleans input data to increase its efficiency for use and sharing Address standardization • Address standardization converts an address from one normalized format into another. It is closely linked to normalization and is heavily influenced by the performance of the normalization process. Sample Address 3620 South Vermont Avenue, Unit 444, Los Angeles, CA 90089-0255 3620 S Vermont Ave, #444, Los Angeles, CA 90089-0255 3620 S Vermont Ave, 444, Los Angeles, 90089-0255 3620 Vermont, Los Angeles, CA 90089
  31. Output data • The last component of the geocoder is the actual output data, which are the valid spatial representations derived from features in the reference dataset. • Data can have many different forms and formats, but each must contain some type of valid spatial attribute. • The most common format of output is points described with geographic coordinates (latitude, longitude). • Alternate forms can include multi-point representations such as polylines or polygons. • These geocoder outputs, while in the same format and produced through the same process, do not represent data at the same geographic resolution and must be differentiated.
  32. Feature matching - Algorithm • The matching algorithms are • Noninteractive matching algorithms (i.e., they are automated and the user is not directly involved). • Interactive matching algorithms • Classifications of matching algorithms • Two main categories: • Deterministic • Probabilistic
  33. Deterministic matching • Ease of implementation • These algorithms are created by defining a series of rules and a sequential order in which they should be applied. Like- “Match all attributes of the input address to the corresponding attributes of the reference feature.” • Attribute relaxation • Attribute relaxation, the process of easing the requirement that all street address attributes must exactly match a feature in the reference data source to obtain a matching street feature, often is applied to create these less restrictive rules. Preferred attribute relaxation order with resulting ambiguity, relative magnitudes of ambiguity and spatial error, and worst-case resolution, passes 1 – 4 Relaxed Attribute Ambiguity Relative Exponent and Magnitude of Ambiguity Relative Magnitude of Spatial Error Worst- Case Resolutio n none none (0) none certainty of address location single address location number multiple houses on single street (0) # houses on street length of street single street pre single house on multiple streets (1) # streets with same name and different pre bounding area of locations containing same number house on all streets with the same name USPS ZIP Codepost (1) # streets with same name and different post type (1) # streets with same name and different type number, pre multiple houses on multiple streets (2) # houses on street * # streets with same name and different pre bounding area of all streets with the same name number, type (2) # houses on street * # streets with same name and different type number, post (2) # houses on street * # streets with same name and different post
  34. Probabilistic matching • Probabilistic matching has its roots in the fields of probability and decision theory • Employed in geocoding processes since the outset (e.g., O’Reagan and Saalfeld 1987, Jaro 1989). • The exact implementation details can be quite messy and mathematically complicated, but the concept in general is quite simple.
  35. Attribute weighting • Attribute weighting is a form of probabilistic feature matching in which probability based values are associated with each attribute, and either subtract from or add to the composite score for the feature as a whole.
  36. String comparison algorithms • Character-level equivalence, • Essence-level equivalence • This allows for minor misspellings in the input address to be handled, returning reference features that “closely match” what the input may have “intended.”. • Word stemming • Phonetic algorithms or the Soundex Algorithm
  37. Soundex Algorithm • It has existed since the late 1800s and originally was used by the U.S. Census Bureau. • The algorithm is very simple and consists of the following steps: • Keep the first letter of the string • Remove all vowels and the letters y, h, and w, unless they are the first letter • Replace all letters after the first with numbers based on a known table • Remove any numbers which are repeated in a row • Return the first four characters, padded on the right with zeros if there are less than four Original Porter Stemmed Soundex Running Ridge run ridg R552 R320 Runs Ridge run ridg R520 R320 Hawthorne Street hawthorn street H650 S363 Heatherann Street heatherann street H650 S363
  38. Challenges of India Geocoding
  39. Geocoding Challenges • Unavailability of geospatial data • Data if available is • Unstructured • Incomplete • Inaccurate • Lack precision • Does not have official name or region developed because of anthropogenic pressure • “General" geocode no longer sufficient; the "most accurate" geocode required • Inconsistent use of base maps and geocoding services within and across programs and agencies Data Inconsistency Erroneous data Frequent data changes requests Coverage New data and modified data in every vintages
  40. Some Samples of real postal deliveries
  41. Classes of geocoding failures with an example of true address “Maulsari BnB, 142 Sunder Nagar, Near Delhi Public School, New Delhi, Delhi 110003, India” Class Geocoded Problem Example 1 No Failed to geocode because the input data are incorrect. Maulsari BnB, 142 Sunder Nagar, New Delhi, 110003, India 2 No Failed to geocode because the input data are incomplete. Maulsari BnB, Sunder Nagar, Near Delhi Public School, New Delhi, 110003, India 3 No Failed to geocode because the reference data are incorrect. 140 Sunder Nagar, New Delhi, Delhi 110003, India 4 No Failed to geocode because the reference data are incomplete. Street segment does not exist in reference data 5 No Failed to geocode because the reference data are temporally incompatible. Street segment name has not been updated in the reference data 6 No Failed to geocode because of combination of one or more of 1-5. Maulsari BnB, 142 Sunder Nagar, Near Delhi Public School, New Delhi, Delhi 110003, India where the reference data has not been updated to include Near Delhi Public School, New Delhi, Delhi address segment 7 Yes Geocoded to incorrect location because the input data are incorrect. Maulsari BnB, 142 Sunder Nagar, Near Delhi Public School, New Delhi, Delhi 110003, India was (incorrectly) relaxed and matched to Delhi Public School, New Delhi, 110003, India 8 Yes Geocoded to incorrect location because the input data are incomplete. Maulsari BnB, 142 Sunder Nagar, Near Delhi Public School, New Delhi, Delhi 110003, India was arbitrarily (incorrectly) assigned to 145 Sunder Nagar, New Delhi, Delhi 110003, India 9 Yes Geocoded to incorrect location because the reference data are incorrect. Sunder Nagar, New Delhi, Delhi 110003, India 10 Yes Geocoded to incorrect location because the reference data are incomplete. Street segment geometry is generalized straight line when the real street is extremely curvy 11 Yes Geocoded to incorrect location because of interpolation error. Interpolation (incorrectly) assumes equal distribution of properties along street segment 12 Yes Geocoded to incorrect location because of drop back error. Drop back placement (incorrectly) assumes a constant distance and direction 13 Yes Geocoded to incorrect location because of combination of one or more of 7- 12. The address range for 140-150 is reversed to 150-140 and dropback of length 0 is used
  42. Geocoding Impacts A good geocoder vs A Bad Geocoder
  43. Data Issues
  44. Steps to address challenges
  45. Overcome the challenge • Standardization of addresses by governing bodies • Integrating addressing data models from variety of addressing system to develop region specific data models • Utilizing multiple data sources for data completeness • Inclusion of local landmarks in geocode process to they form integral part of Indian address system • Embed locational awareness and intelligence within geocoding data models • Multi Lingual phonetic support • Large test set to address different kinds of geocoder irregularities • Validation methodology to confirm geocode results • Changes in map policies for consistent capturing • Homogeneous information dissemination by governing bodies in regard to spatial data • Standards protocols developed for input/reference data correction • Address normalization • Address standardization
  46. Features to address challenges • Sub locality Features • Inclusion of Local Landmarks • POI Level Geocoding • Bank Dictionaries • Focus on Local Geography and Social Settings
  47. ……. Questions?