Your SlideShare is downloading. ×
0
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Icc2013 country names
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Icc2013 country names

202

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
202
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Forms of country names range from those in use by the countries themselves (endonyms) to externally used alternatives (exonyms), to various common abbreviations (e.g. USA) and codes (such as those in ISO 3166). Indexes are produced by a diversity of communities including United Nations agencies, Non-Government Organisations (NGOs- such as humanitarian relief or environmental assessment groups) and commercial enterprises (postal agencies, distribution companies).
  • Each of these issues is experienced to differing degrees, with particular regions more affected than others. Utopia Way Inc. investigated 5577 csv files in the data.un.org dataset (UN Statistics Division’s Internet-accessible repository for data) to explore country name alignments and mismatches published by UN agencies in their datasets. In all, 21,195,188 rows of data were analysed. Data that was excluded from the investigation included:
    -footnotes at the end of each dataset;
    -the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and,
    -25 files published in multi-sheet Excel format.
    Indices and headers from all the datasets were collated into lists: the headers list was searched for geographical references, and the indices list was used to produce a list of corrections from the data.un.org geographical indices into both ISO 3166 and United Nations Statistics Divison’s list Country and Region Codes for Statistical Use of region, country and economic group names.
    Geographical references in the headers are:
    -country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin, reference area.
    -OID.
    -WMO station number, station name, national station id number.
    -City.
    -Area, residence area, city type.
  • Most of the data.un.org datasets contain information that is listed by country (e.g. Yemen), region (e.g. West Africa) or economic group (e.g. Developing Regions).   The placenames in the indices are a mix of country, region and economic group names, with different spellings and formats for similar names. For example, in one instance the following spellings can be located for one country-. “Yemen”, “YEMEN”, “Yemen,Rep.”, “Yemen, Republic of”.
    Two standards are similar to the placenames used in these files: ISO3166 and the “composition of regions” list published by data.un.org.
    ISO3166 is a widely-used standard, but contains code for countries and their subregions only (e.g. has no official lists of larger regions or economic areas) and is published as tables online and available (although without the list of withdrawn codes) in the Python library pycountry.
    The UNstats list (which ISO3166 is partially based on) contains countries, regions and economic areas, but is available only as an html table (http://unstats.un.org/unsd/methods/m49/m49regin.htm).   This table was scraped (the data copied from its html page) by hand for this research, but this process could be automated using e.g. ScraperWiki. There are two main lists in the UNstats table: the regions, subregions and countries by physical location, and the economic status (e.g. “Developing regions”, “Least developed countries”) of each country and region.  These are mostly consistent, with a couple of oddities, e.g. Netherland Antilles doesn’t appear on the list of countries, but does appear on the list of small island developing states.
    Work by other groups (e.g. the World Wide Human Geography Data working group) has also translated data.un.org files into the FIPS 10-4 standard. This standard is common in US Government work; it includes codes for country names and administrative districts in each country, but does not include regions (e.g. Africa). It is similar, but not identical, to ISO3166.
     
     
     
    The indices were checked against both these standards.  Against the ISO3166 standard, common data.un.org csv index errors were:
  • Some names could not be resolved: remaining queries include the code for French Polynesia, whether “Christmas Is.(Aust)” is Christmas Island, whether St. Helena refers to just the island of Saint Helena, or “Saint Helena, Ascension and Tristan da Cunha” and whether Palestine and Palestinian Territories refer to “Palestinian Territory, Occupied”. Other issues include whether Micronesia refers to the region (Micronesia) or country (Micronesia, Federated States of), and whether there should be separate codes for changing states, e.g. Ethiopia before and after 1993.
     
    The current correction files for UNSTATS and ISO3166 standards, along with a CSV file containing the UNSTATS standard codes fromhttp://unstats.un.org/unsd/methods/m49/m49regin.htm can be found at xxxxxx
  • From an international standards perspective there are multiple competing GRDs of country names published by various agencies including the UN and ISO. There are diverse reasons for the existence of the varieties, including different end-user requirements which predicate whether official endonyms are required for mapping purposes or country codes used for statistical purposes. A brief summary of the key GRDs is provided in table one to contextualise the current international GRD situation.
    As indicated, there are multiple official GRDs of country names published at the international level by the UN and other organisations. The existence of multiple datasets related to country name standardisation is analogous to the mismatched country name data held within UN datasets.Examples
    Analysis of the key UN databases held in data.un.org has identified key matching, linking and interoperability issues currently experienced in the domain of GRDs which contain country names. These can be summarised as:
    Non-standardised use of country endonyms/exonyms by UN agencies
    Mismatches between data instances in authoritative country name GRDs
    Temporality of country name GRDs
     
  • This paper explores two aspects of the propagation of country name SISets: development (cultural/qualitative) and usage (data management/quantitative). From a development perspective the fundamental question is asked of why , when country names can be considered one of the highest-order administrative categories for geospatial organization, there is a proliferation of ‘official’ country name SIRDs. Within the domain of usage the authors query how, in a digital age of ‘big data’ analytics and Spatial Data Infrastructures (SDIs), newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) can assist in reducing the ambiguity associated with multiple, heterogeneous country name SIRDs.
  • Rob- essentially, Toponymic Attachment means that people have strong affinities with place names, for cultural, social, branding and wayfinding purposes. Because of this, people are very hesitant to stop using a name. In fact, it is nearly impossible to get them to stop using a name. Also, people will create new ‘nicknames’ for things so that they can create a ‘clique’ or communicate in a community with their own ‘special terms’. It’s all about creating and reinforcing identity.
    Thus, in a world of multiple names in multiple databases, there is a massive headache for data junkies. Data users want straightforward stuff, and usually the people who create standards want people to be using the same names in the same way all the time. But, that’s not how the real world works.
    So, data junkies can try and tell people to use standardised names, and can create ISO lists etc etc etc. But because of human nature, the standardised lists will always have gaps and mistakes and won’t truly reflect usage on the ground. Thus accounting for some of the reason for why there are mutliple country name lists.
    And thus the reason for why SIRF is awesome- because instead of force-feeding people the standardised-name-line, it allows for a holistic view of naming which accounts for multiple representations, permutations, interpretations etc. It is, as I like to say, ‘ideologically promiscuous’ 
  • Until now the preference of many agencies has been to homogenize geospatial information for ‘ease of use’ purposes- either through aggregating and de-duplicating existing SIDs or by disregarding competing information. SIRF is a system being developed by CSIRO using Linked Data mechanisms to support interoperability between heterogeneous geospatial information datasets and systems. SIRF harmonises disparate SIRDs through cross-walking and data linking methods, the benefits of which are outlined in detail by the authors. The framework system brings to the geospatial data management world, for the first time, the capability to streamline information integration processes whilst acknowledging the reality of multiple, competing SIRDs.
  • Data products linked in practice....
  • On he web you may not know the data product...
  • Transcript

    • 1. Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives International Cartographic Conference, Dresden Laura Kostanski| Sara-Jane Farmer | Rob Atkinson August 2013 GOVERNMENT AND COMMERCIAL SERVICES THEME
    • 2. Today’s Presentation • Overview • Cultural Reasons for Multiple Country Names • Impact of Cultural Reasons • Multiple Country Name Datasets • Reconciling Information • Spatial Identifier Reference Framework (SIRF) Approach
    • 3. Overview • There are multiple country name datasets in use • e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO • Multiple stakeholders in creation and use of data using these names • e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups. • Time spent accessing and reconciling data is costly and delays production of results from analysis • The same issues apply to most, perhaps all, identifiers of spatial objects • Preview of how we might tackle this problem
    • 4. Context CSIRO. UNSDI Gazetteer for Social Protection in Indonesia
    • 5. Data Analysis Utopia Way Inc. investigated files in the data.un.org dataset. … Country names were discovered in multiple fields, such as: •country of birth, •country of citizenship, •country or area, •country or territory, •country or territory of asylum or residence, •country or territory of origin, •reference area. and identified significant issues with country name alignments and mismatches. An automated matching process was set up to explore the extent of the issue. In all, 21,195,188 rows of data were analysed.
    • 6. Common “Errors” Index error Withdrawn countries with no ISO3166 code Abbreviation Added markers Capitalisation Brackets “()” or “[]” instead of commas Standards confusion Examples “East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR, Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen, Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and Montenegro". “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”. “+” added to the end of region names, to differentiate them from countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”. “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for “The”. “Virgin Islands (British)” for “British Virgin Islands”. The ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not). Use of familiar names issues with character translation Brunei, Ivory Coast, China, Libya Cote d'Ivoire, Åland Islands, Curaçao, Réunion Misspellings Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.
    • 7. Long names, short names
    • 8. Data sets providing country names Organisation Name of Data Set United Nations Statistics Division Country and Region Codes for Statistical Use Working Group on Country Names, United Nations Group of Experts on Geographic Names Terminology Section, Department for General Assembly and Conference Management International Standards Organisation (ISO) Food and Agriculture Organisation of the United Nations United Nations Geospatial Information Working Group (UNGIWG) National Geospatial Intelligence Agency List of Country Names NATO Standards Agreement (STANAG) 1059 Multilingual Terminology Database (UNTERM) ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3) Global Administrative Unit Layers (GAUL) Second Administrative Level Boundaries (SALB) Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special Sovereignty, and their Principal Administrative Divisions
    • 9. Two Aspects of Country Name Datasets 1: Development of datasets Why is there a proliferation of country name sources? • Cultural issues • Development practices 2: Usage How, in a digital age of ‘big data’ analytics and SDIs, can newly emerging technologies such as the Spatial Identifier Reference Framework (SIRF) assist in reducing the ambiguity associated with multiple, heterogeneous country name sources? • Can we do better? What do we need to do it?
    • 10. Cultural Issues • Toponyms provide communities with identity (Toponymic Identity is both reflected and reinforced) • Country names are the highest-order toponyms • Problems are similar at lower levels, compounded by scale (size of problem) and higher rates of change (e.g. electoral boundaries, urban growth)
    • 11. Endonym/Exonym Above and beyond associations with an individual’s attachment to the Endonym of their country, there are often multiple Exonyms used by other languages. • e.g. Deutschland= Germany or Allemagne
    • 12. Other Cultural Country Naming Considerations Formal/Informal naming applications (particularly prevalent in the social media world- e.g. ‘Oz’ for Australia) Political/Non-Political Usage e.g. ‘Commonwealth of Australia’ Change over time e.g. Czechoslovakia Non-standardised international conventions e.g. Saint or St? The or none?
    • 13. The Impact All of these cultural mores impact on the ability of people and organisations to record country name information in a standardised, transparent manner. Thus, there exists a proliferation of country name lists which are officially promoted by international agencies. This impact is then intensified in usage,
    • 14. Options Suggested improvements to the indices and standards include: 1. Improve access to source data a. b. Make the UN’s regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years). Make the UN’s economic status list available as a csv file online. 2. Lobby to improve content a. b. ISO to create a region (Africa, West Africa, North America etc.) code standard. ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name). 3. Policy a. Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN online development data should attempt to adhere to. 4. Better citation mechanisms – – Standardised metadata and identifiers that “resolve” – i.e. links back to data Shared infrastructure to link all the information together
    • 15. Spatial Identifier Reference Framework CSIRO has been working with stakeholders including UN, National agencies and others on a set of standards and infrastructure services to support discovering and linking multiple sources of spatial references. This is being presented in more detail in: 6D.3 Spatial Identifier Reference Framework (SIRF): Realising the potential of SDI Using Spatial Identifiers to Link Multiple Information Systems (#633) Paul Box 1, Robert Atkinson 1, Laura Kostanski 2 S6-D - SDI Tuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room: Conference Level - C1
    • 16. One real world feature: a bus station BIG National Gazetteer of Indonesia Identifier Feature Type Merak, Stasiun Bis Transport Department of Transport Bus Terminals Identifier Feature Type Footprint Merak Terminal Polygon Footprint Point Currently systems are disconnected and difficult to integrate Merak Merak, Stasiun Bis Represented in multiple systems using different names, and classified and represented in different ways Terminus Dataset Gazetir Indeonesia Merak, Stasiun Bis (Gazetteer Entry) Gazetir Indonesia (Gazetteer) Used in Navigation application Linked Resource Same as Online Public Transport Map Linked Resource Merak (Gazetteer Entry) Terminus Dataset (Gazetteer) Used in Passenger Travel Stats Application Linked Resource Spatial Identifier REFERENCE FRAMEWORK Links gazetteers (based on same feature in different gazetteers) used in web applications and other online resources.
    • 17. Identifiers This is the “tricky part” Lets start with the practical implication… Catchment ExtractionRate Storage 1123343 730 300 Catchment Boundary Area Geometry 1123343 33535.4 151.3344,35.330…….
    • 18. “Distributed” references Catchment ExtractionRate Storage 1123343 730 300 How to ask for this entity Internet How to deliver this entity Catchment Boundary Area Geometry 1123343 151.3344,-35.330……. 33535.4
    • 19. SDI resource access One real world feature: a bus station BIG National Gazetteer of Indonesia Provenance URI Identifier Feature Type Merak, Stasiun Bis Transport Department of Transport Bus Terminals Identifier Feature Type Footprint Merak Terminal Polygon Footprint Point Currently systems are disconnected and difficult to integrate Merak Merak, Stasiun Bis Represented in multiple systems using different names, and classified and represented in different ways Terminus Dataset Gazetir Describe Indeonesia Discover Merak, Stasiun Bis (Gazetteer Entry) Gazetir Indonesia (Gazetteer) Used in Link Navigation application Linked Resource Same as Online Public Transport Map Linked Resource Merak (Gazetteer Entry) Terminus Dataset (Gazetteer) Used in Passenger Travel Stats Application Linked Resource Spatial Identifier REFERENCE FRAMEWORK Links gazetteers (based on same feature in different gazetteers) used in web applications and other online resources.
    • 20. Thank you For more information Rob.atkinson@csiro.au GOVERNMENT AND COMMERCIAL SERVICES THEME

    ×