For this presentation we will talk about the role of data in effective use of data. We will also cover the proper data structures and schemas for use of GIS as well as review the notion of metadata. Lastly we’ll review some important issues concerning linking data as well as discuss issues of confidentiality.
To review, you will remember that GIS combines software, hardware, procedures, people and data. Each element is important, but use of GIS is easier when the data is well formed and ready to go into GIS.
There is a rule of thumb with GIS work known as the 90% rule. It states that for any GIS activity, 90% of the cost will be devoted to data preparation, and 10% to actually producing maps. T
This means that before any map can be produced, many tasks will need to be completed in order to produce maps. For instance, it is necessary to collect, clean, validate, format the data to make sure it is accurate. Then the data may need to be linked with other data to be used, which means that there may be additional work needed to make this possible. For mapping, there is indeed work to be done, but comparatively speaking, much less.
As you can see, data is important in GIS. In fact, GIS analysis is only as strong as the data used.
Data, whether in a GIS or not, should of course be accurate. This means that it reflects reality as much as possible. In GIS there are two types of accuracy to be concerned with: spatial accuracy which refers to whether items are located correctly and attribute accuracy, which refers to the attributes. Here this means that the attributes are correct and are properly linked to geography.
Here is a representation of spatial accuracy. Let’s say you found online a file with latitude and longitude coordinates of hotels in India. You decide you want to create a shapefile with these coordinates. When you then overlay them on images in Google Earth, you see that the points aren’t accurate. Here’s the scene in Google Earth <CLICK TO DISPLAY FIRST ANIMATED ELEMENT> And here is the location of the Hotel Suryaa <CLICK TO DISPLAY NEXT ANIMATED ELEMENT>. This location is inaccurate because the real location of the hotel Suryaa is here. <CLICK TO DISPLAY NEXT ANIMATED ELEMENT>. The point is off by 50 meters or more.
Spatial Accuracy can be affected by scale. For instance here is the same point when viewed at a different scale. <CLICK TO DISPLAY ANIMATED ELEMENT> At this scale the point location is still inaccurate, in that it isn’t the exact latitude and longitude for the hotel, however because our scale exceeds the error of the point, the effect is less obvious. In fact if the location derived using a map at one scale, the accuracy can be assessed by using a map at a smaller scale (a map that has “zoomed in”)
To illustrate, here is a screen shot from Google Maps. Even though it isn’t a GIS, it does rely on a spatial database in that it has locational information and attributes about the locations. If you zoom into the location of the hotel, you see that <CLICK TO DISPLAY FIRST ANIMATED ELEMENT> instead of saying the building is the Hotel Suryaa, it has the building listed as “Hotel Crowne Plaza”
The example from Google Maps illustrates another consideration for strong data, timeliness. Their database is old and doesn’t reflect that this hotel is now the Suryaa and no longer the Crowne Plaza. The world changes, that means that spatial databases, or any data set can quickly become out of date, so it is important to be aware of the timeliness of the data. The data doesn’t necessarily have to be the most recent, sometimes there may be value in having older files, for instance if you want to track changes over time. However, you as the data user needs to be aware of the time frame of the data you use and include information about the time frame of the data you create.
Software, whether it’s a GIS program or not, must know how to read and interpret data files. This means that the data needs to store the data in a standard way that the software expects. The way that the data is stored is known as it’s structure or more commonly, schema.
There has been a standard schema that has evolved over the years for data and it is considered best practice to use this schema generally, whether the data will be used in a GIS or not. This standard schema is as follows: one record per geographic unit, variables are stored in columns and there are no blank cells unless data is missing
Here’s an example of poor data schema. The variables are listed as rows and the columns are the geographic units. It is still an valid way to display data for a table in a publication or presentation, but you would not want to store data using this schema if you wanted to use it in a GIS.
Here is another example of poor data schema. There are several things wrong with this table. First, there are blank cells that don’t represent missing data. The blank cells are supposed to indicate that the values of the last cell is to be repeated. <CLICK TO ADVANCE ANIMATION>. Second there are duplications for district names. In this made up country, there are districts with the same name in different provinces. <CLICK TO ADVANCE ANIMATION>. We’ll come back to this problem in a little bit.
Here is a proper data schema for a GIS program. As you can see there is one record per geographic unit. In this case Region. Regions don’t duplicate. Columns contain variables. Each cell contains well formed data.
As I mentioned, proper documentation is a key component of strong data. Including metadata is the best way to document data. Simply put Metadata is data about data. It provides the data user with information about the data such as: <READ SLIDE>
Metadata is especially important with spatial data because of issues of : Spatial accuracy: it’s important for data users to know how the data was collected, if there are scale issues to consider Coordinate systems and datums: sometimes it is clear what coordinate system was used, but other times it isn’t. Without metadata the user may not know what coordinate system/datum the data is in and this may make it difficult to use the data. Confidentiality: Spatial data can raise issues concerning confidentiality and privacy. The metdata can make sure data users are aware of these issues and what restrictions may exist on sharing, the data or even presenting maps Timeliness: This one should be obvious, when the data was collected
Because metadata is so important, the international standards organization (ISO) has produced an international standard for geographic metadata. The ISO 9115 standard mandates certain elements be included in the metadata. It also developed the schema or structure for metadata. For more information about the ISO9115 standard, you can visit their web site. It’s important to note that many countries have developed their own national standards for spatial metadata, these national standards should be compatible with the ISO standard. It is important to research any national metadata standards you may want to conform to.
Here is an excerpt from metadata for a file obtained from the UN’s Second Administrative Level Boundary (SALB) site. The actual metadata file is much longer and contains many more elements, but this will give you a example of the type of information that is contained in a metadata file.
Most data programs differentiate between different data types and will assign variables to be one type or another the way the field is assigned can affect the way the program handles data.
For fields that are defined as text, arithmetic operations such as addition and subtraction are not allowed. For fields that are defined as numeric, arithmetic is permitted. One issue however is that many programs may require the user to declare the number of decimal places before entering data. This is an important consideration when storing coordinates in a field, since if inadequate number of decimal places are declared, the full coordinate may not be able to be stored which can have an impact on accuracy.
One of the key tasks that a GIS needs to be able to do is linking tables. GIS uses a key field to make the link between tables. A key field is the field that contains information common between the tables. It is important to remember that it is not possible to link tables using key fields that are two different types. In the next slides I’ll illustrate this.
Here are two tables. <ASK GROUP> What is the field that will be the key field? <ANSWER: DISTRICT> <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> It is possible to link these two tables using the common field, District. Just a note to point out that it is no coincidence that a geographic unit is the key field. As we’ve mentioned, geography is the common link between human activity. <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> As you can see there is now a link between the two tables.
One important thing to point out is linking using text fields can be problematic because of variations in spelling.
Here are two tables, notice that they each have a different spelling for the district North Kinley. <ADVANCE SLIDE TO DISPLAY ANIMATION> <ASK GROUP> What do you think will happen? Will it be possible to join these tables? <ADVANCE SLIDE TO DISPLAY ANIMATION> The answer depends on the software and the settings you select, for many GIS programs, a link will be made for those records that do match. It’s easy to see that the linked table doesn’t have the complete number of records in this example, but if you had many records, it might be possible to miss this fact. So a good practice is to check the record count after the join to make sure it is correct.
As you can imagine, there are many different ways text fields can be problematic. Linking using numeric fields is often more reliable since they are less vulnerable to variations. For this reason, countries often use numeric codes to identify administrative units. Often the national bureau of statistics or census bureau manage such codes. If there are standardized national codes, it is a good idea to include them in databases.
So here are two tables with a field for district code which were assigned by the national bureau of statistics. If District code is used as the key field <ADVANCE SLIDE TO DISPLAY ANIMATION> then spelling variation in the district field doesn’t matter and the table can be joined successfully. <ADVANCE SLIDE TO DISPLAY ANIMATION>
Another advantage of numeric codes associated with geography is they can manage geographic hierarchy effectively. So let’s say this is North District. North District is divided into three provinces. <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> Coast province, mountain province and savanna province. North district has a code of 100. Most countries set up their national codes so that hierarchy is included. <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> As you can see from the table, all of the provinces are numbered in the 100’s since they are in North District.
To review, here are the key points from the discussion on linking data <READ SLIDE>
Now to switch topics slightly. Confidentiality is an important consideration when working with spatial data. During this part of the lecture, we’ll discuss issues of confidentiality and spatial tools as well as present strategies for protecting confidentiality.
So let’s start by talking about confidentiality and what I’m referring to. Put simply, confidentiality is the idea that it is important to protect the identity of individuals. This is a requirement of many informed consent agreements that people sign when we collect data. It’s also a pillar of ethical research.
There are two threats to confidentiality, one is overt disclosure. Overt disclosure is the the act of explicitly making data available that breaches confidentiality commitments. Such as releasing data files that contain an individual’s name and/or data.
The second way that confidentiality can be breached is through deductive disclosure. That the process of piecing together multiple pieces of the puzzle until a picture emerges. So for instance, let’s say there was a survey conducted and you knew if you knew that a person was 45 year old female that narrows down the list somewhat [ADVANCE SLIDE] then if you knew that she has 5 children that narrows it down even more [ADVANCE SLIDE] if you know that she works for General Electric in Delhi that makes it a little easier to potentially identify a person. [ADVANCE SLIDE] If you add a geographic coordinate of where they live. [ADVANCE SLIDE] It’s almost the same as listing a name.
When you add a spatial component to data it can be an overt disclosure of identifying information. At the very least it makes deductive disclosure easier. So what’s the answer? Should the spatial element be dropped?
There is an emerging recognition that there is a need to explicitly define issues of geoprivacy. Geoprivacy is a term coined to refer to “an individual’s right to prevent disclosure of the location of one’s home, workplace, daily activities or trips”
As people have thought about this issue of geoprivacy, there are 4 principles that have been laid out to guide people: [READ SLIDE] I’ll talk about each of them
The first principle is the basic protection of confidentiality. This protection is fundamental to ethical research. Information that might lead to physical, emotional, financial or other harm. It’s important to protect information that discloses identity
The second key principle that informs the discussion on confidentiality is the importance of preserving the social-spatial linkage. As we’ve mentioned, all human activity takes place on earth. Understanding that adds context and perspective. Its also a key to advancement of science.
The third principle is the notion of data sharing. Data sharing means sharing data with other researchers or other important stakeholders. It’s essential on both scientific and financial grounds. It allows the data to have maximal use by letting other researchers use the data. Lastly, there’s a growing trend among funders of data collection efforts that the data be shared either publicly or within the research community.
The last principle is the notion that data should be preserved and be available for future use. This raises the question, how long should the data be deemed “sensitive”? When if ever, can it be released? These are things that should be considered at the beginning of any data collection effort or establishment of a data system. It should be spelled out in advance to respondents or individuals who are providing information/data.
What are the strategies that can be employed to protect data?
The first strategy we’ll talk about is simply just randomly shifting the locations. The advantage of this is that is relatively easy to do. There are plugins for QGIS that will do this. The disadvantage is that you lose the original location and it introduces error
The second strategy is what’s known as an affine transformation. This is a systematic change to the data, changing the scale, rotating shifting a set distance. This is easy to do, but it’s also easy to undo if people know the parameters of the transformation. In some cases even if the exact parameters aren’t known, it’s still possible to deduce the types of transformation done if a set of points don’t match the geography on the ground (say points end up in the ocean or lake because of the transformation)
Another strategy is to just aggregate the data. So say for instance you have individual patient data you can just aggregate to mask individual data. This too is easy to do, but it does require sufficient number of data points. Finer data variations will be lost.
Another strategy is to despatialize the data. Simply remove the coordinate system that ties the data to the earth. It uses euclidean space instead of geographic space. This is simple, it keeps relative position and placement. On the downside though, you lose contextual data, so it won’t be possible to bring other data that might be helpful to look at (such as road networks or the surrounding landscape).
Lastly there’s always, “do nothing”. You could make the decision to not collect or release the data. Another option would be to set up a cold room or on-site analysis only. This maintains all of the original spatial data. The disadvantage is that schemes like cold-room or on-site analysis can reduce accessibility to data which can limit social-spatial link and can be complicated to implement.
There is no magic answer. It’s a matter of finding the technique that best suits your needs and the commitments made to respondents. It’s possible to think about the issue in terms of spatial integrity and disclosure risks. Making the decision on what approach to take is dependent on where on this spectrum you want to land. You can preserve spatial integrity or you can minimize risk of confidentiality breaches, but you can’t have both.
The article by Van Wey that I mentioned earlier has a quote that ignoring the issue is unacceptable. Something that often gets lost in the excitement over GIS is the issues around confidentiality. However, those who collect data must think about the confidentiality issues and make sure their informed consent agreements adequately describe the way data will and won’t be used. Data users also have a responsibility to ensure that any extra contextual analysis they do doesn’t increase the risk of deductive disclosure.
Key points are [READ SLIDE] Any questions?
Data 101: Fundamentals of Data in GIS
Data 101Fundamentals of data in a GIS
Overview Role of data Data structures and schemas Metadata Linking data Issues of confidentiality
Data accuracy Data should accurately reflect reality In GIS there are two types of accuracy to be concerned with: Spatial accuracy Items located correctly Attribute accuracy Attributes are correct and properly linked to geography
Timeliness Is the data for the time period of interest? Boundaries change New features created Features change
Data Structure Proper data structure is necessary in order to effectively use data Software must know how to read the data, and query it. The structure of the data is also known as data schema
Data Schema For most programs, data will need to be stored in a row and column format GIS programs expect well formed data in the following schema: One record per geographic unit Geographic units don’t repeat in records Variables are stored in columns No blank cells unless data is missing
Data SchemaPopulation China India United Indonesia StatesTotal 1339724852 1210193422 312417000 237556363Percent of 19.23% 17.37% 4.48% 3.41%World’sPopulationPopulation 140/km2 368/km2 32/km2 121/km2DensityPoor data schema•Columns are geographic units•Variables are rows
Proper Data Schema Columns are variables One record per geographic unit
Metadata Data about data Provides information on: Source of data Who created it When it was created Coordinate system and datum Usage and sharing restrictions
Metadata Metadata is especially important with spatial data because of issues of: Spatial accuracy Coordinate systems and datums Confidentiality Timeliness
Metadata formats International standard ISO 9115 Mandatory elements Schema for metadata Countries may have their own national standards that are compatible with the ISO standard but provide extra elements
Data Types Text Numeric Coordinates Programs assign variables to be a specific type which can affect the way the program handles data
Data Types Text Arithmetic can not be conducted on values in text fields Numeric Arithmetic permitted May require user to declare number of decimal places before entering data This can be important when storing coordinates
Linking data Key field The field that contains information common between tables Tables are linked using the key field Can’t link using key fields that are two different types
District Population Male Pop Female Pop North 24015 14409 9606 West 31154 16202 14952 South 62442 29972 32470 District Area (sq km) North 243District is the key field West 310 South 602District Population Male Pop Female Pop Area (sq km)North 24015 14409 9606 243West 31154 16202 14952 310South 62442 29972 32470 602
Linking data Linking using text fields can be problematic Variations in spelling
District Population Male Pop Female Pop North Kinley 24015 14409 9606 West 31154 16202 14952 South 62442 29972 32470 The two tables have District Area (sq km) different spellings for N. Kinley 243 the district North Kinley West 310 South 602District Population Male Pop Female Pop Area (sq km)West 31154 16202 14952 310South 62442 29972 32470 602
Linking data Linking using numeric fields is often more reliable and less vulnerable to variations and other issues Countries often use numeric codes for administrative units to get around problems with spelling variations If standardized national codes exist, it is a good idea to include them in data National Bureau of Statistics or Census often manage such codes
District Dist code Population Male Pop Female Pop North Kinley 100 24015 14409 9606 West 200 31154 16202 14952 South 300 62442 29972 32470 District Dist code Area (sq km)Dist code is the N. Kinley 100 243key field West 200 310 South 300 602District Dist Code Population Male Pop Female Area (sq km) PopNorth 100 24015 14409 9606 243West 200 31154 16202 14952 310South 300 62442 29972 32470 602
Advantage of numeric codesCan manage hierarchy effectively District Province Code Coast North Coast 101Savanna North Mountain 103 North Savanna 105 Mountain North District Code 100
Linking data key points Key fields must be of the same type Text fields can be problematic due to spelling variations Numeric fields are often a more reliable key field Unique geography codes, if available in a country is often the best option for making linkages
Data and confidentiality issues Important issue when working with spatial data Discuss issues of confidentiality and spatial tools Present strategies for protecting confidentiality
Confidentiality Protecting identity of individuals Requirement Informed consent agreements Ethical research
Overt disclosureThe act of explicitlymaking data availablethat breachesconfidentialitycommitments.
Deductive Disclosure45 year old 45 year old 45 year old female female female Has 5 children Has 5 children Works for General Electric in Delhi28.67171, 77.21211
Spatial Data Overt disclosure Makes deductive disclosure easier
Geoprivacy“[an] individual’s right toprevent disclosure of thelocation of one’s home,workplace, daily activitiesor trips.”Protection of geoprivacy and accuracyof Spatial Information: How Effective areGeographical Masks?Kwan, Casas, SchmitzCartographica, Vol 39, #2
Four Principles Protection of Confidentiality Social-Spatial Linkage Data Sharing Data PreservationConfidentiality and spatially explicit data: Concerns and challengesVanWey, Rindfuss, Gutmann, Entwisle, Balk PNAS, vol. 102, no. 43
1. Protection of Confidentiality Fundamental to ethical research Information that might lead to physical, emotional, financial or other harm Protection of information that discloses identity
2. Social-Spatial Linkage All human activity takes place on earth Understanding that adds context and perspective Key to advancement of science Essential for understanding the diffusion of behaviors
3. Data Sharing Essential on both scientific and financial grounds Provide access to data for other researchers Condition of funders
4. Data Preservation Data available in the future How long should data be deemed “sensitive”? When, if ever, can it be released
Random Perturbations Random shifting of point locations Pros: Easy (relatively) to do Cons: Lose original location, introduces error
Affine Transformation Change scale Rotate Shift a set distance Combination Pros: Easy to do Cons: Easy to undo, can impact some types of analysis
Aggregate Point locations are aggregated to higher unit of analysis Pros: Easy to do Cons: Requires sufficient data points, Finer data variations will be lost
Despatialize Remove Coordinate System Use Euclidean space Pros: Simple, keeps relative position and placement Cons: Loses contextual data
Nothing Do not collect or release data Cold room or on-site analysis only Pros: Maintains all of the original spatial data Cons: Complicated, limits data sharing, limits social-spatial link
Mx u a im m Spatial IntegrityM im m in u Mx u a im m M im m in u R k is Disclosure R k is
“Ignoring is unacceptable” Can get lost in the excitement about GIS Those who collect data must think about the confidentiality issues Data users must also think about how their analysis may increase the risk of deductive disclosure.
Key points Confidentiality issues arise when spatial context is included in data. It’s important to protect confidentiality. People have an expectation that their identities are protected. There are strategies that can preserve confidentiality, but there is no “one-size-fits-all solution”