Integrating SAS® and Geographic Information Systems for Regional Land Use Planning
                              Bill Bass...
In addition, H-GAC obtains parcel boundary datasets from the counties that compliment the appraisal roll data. Parcel
boun...
determine which records to read. For example, it is common for not only the appraisal data that includes ownership,
valuat...
These items, if they exist, have a value associated with them. If the value is not known, then it should not be zero, but
...
First, GIS is used to calculate a centroid value for each polygon in the original parcel dataset, which is expressed as an...
The above SAS® statement creates a dataset that contains all Parcel IDs from the dissolved parcels dataset, and their
affi...
*Keeps only unique Parcel ID and State Class Code combinations;
Proc SQL;
      Create Table Unique_Parcels_SC AS
      SE...
The above statement creates a two column table that contains a field for Parcel ID and the merged State Class Code value
s...
References

H-GAC (Houston-Galveston Area Council). 2008. www.h-gac.com.

Contact Information

Bill Bass, GISP
Houston-Gal...
Upcoming SlideShare
Loading in …5
×

Integrating SAS and Geographic Information Systems for Regional Land Use Planning

932 views

Published on

Forecasting efforts require the integration of geographic data and large amounts of tabular data from various sources, such as parcel boundary datasets and county appraisal records. By using SAS in conjunction with ESRI ArcGIS geographic information systems (GIS) software, one can produce a comprehensive land-use database using large amounts of data. The integrated process involves millions of appraisal data records as well as large volumes of geographic data.

Published in: Technology, Real Estate
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
932
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Integrating SAS and Geographic Information Systems for Regional Land Use Planning

  1. 1. Integrating SAS® and Geographic Information Systems for Regional Land Use Planning Bill Bass, Houston-Galveston Area Council, Houston, Tx ABSTRACT The Houston-Galveston Area Council (H-GAC) provides regional socio-economic and land-use forecasting analysis for the 13-counties surrounding the Houston metropolitan area. Forecasting efforts require the integration of geographic data and large amounts of tabular data from various sources, such as parcel boundary datasets and county appraisal records. H-GAC uses SAS® in conjunction with ESRI® ArcGIS® geographic information systems (GIS) software to produce a comprehensive land-use database for the 13-county region. The integrated process involves millions of appraisal data records as well as large volumes of geographic data. Through the combined use of SAS® and GIS, H-GAC is able to streamline the data development process, over using other SQL and desktop database technologies. INTRODUCTION H-GAC is the region-wide voluntary association of local governments in the 13-county Gulf Coast Planning region of Texas. It is one of several Council of Government organizations (COGs) in the State of Texas, and services 12,500 square miles with more than 5.7 million people. H-GAC is governed by a Board of Directors composed of local elected officials who serve on the governing bodies of member local governments. There are 35 members on the H GAC Board. H-GAC provides many tools, information, region-wide plans, and services to support municipalities, districts, and non-profit organizations. H-GAC's mission is to serve as the instrument of local government cooperation, promoting the region's orderly development and the safety and welfare of its citizens (H-GAC 2008). One of H-GAC’s programs includes regional socio-economic modeling. The Socioeconomic Modeling group is an information and research hub in the Community and Environmental Planning department that gathers, processes, generates, analyzes, and disseminates information on the past, present, and future land use, economy, and population of our region in order to support comprehensive regional operations and planning (H- GAC 2008). The primary purpose of forecasting efforts within the socio-economic group is to support Travel Demand Modeling which is used in Regional Transportation Planning (RTP). However, H-GAC also uses socio-economic products for other long range planning purposes that involve environmental conservation, water quality, and urban planning. Due to the large amount and complexity of the data obtained for use in socio-economic modeling, SAS® is used for a variety of functions including: data development, data organization, statistical analysis, and integration of data across multiple databases. This paper will explain how H-GAC’s Socio-Economic Modeling group uses SAS® in conjunction with GIS to develop regional land use data, which is one component of the overall regional modeling framework employed at H-GAC. PROCESSING OF COUNTY PARCEL BOUNDARY AND APPRAISAL DISTRICT DATA H-GAC obtains appraisal data from each of the 13 County Appraisal District (CAD) offices where data is electronically available. Appraisal data are typically very large datasets that cover a wide variety of attributes regarding parcels (real property) within each county. Some of the data attributes included in appraisal roll datasets are: Valuation of land and improvements (e.g. buildings) Land usage through the State Classification Coding framework Ownership and legal descriptions of property Taxing entities and exemptions Square footage and structural amenities 1
  2. 2. In addition, H-GAC obtains parcel boundary datasets from the counties that compliment the appraisal roll data. Parcel boundaries are typically provided in industry standard shapefile formats that can be viewed in GIS software, such as ESRI® ArcView® or ArcInfo® products. In many cases, the parcel boundary data and appraisal roll data are not related in a manner that allows for usage as a relational database system; although they do have common fields in both datasets, such as Account Number. Furthermore, data schemas for datasets are not standardized across county appraisal systems, and thus yield a variety of source data layouts and structures with a variety of field naming conventions. ISSUES AND CHALLENGES IN WORKING WITH APPRAISAL DATA Through H-GAC’s efforts in working with appraisal roll data, a number of challenges have been identified and overcome in order to develop a comprehensive regional appraisal database. These challenges exist in both the appraisal roll dataset that contain the property attribute data, as well as in the GIS parcel boundary datasets. Challenges for working with appraisal roll data include: Multiple datasets stored within a single text file, each with their own unique data schema The need to convert data imported as character format to numeric, and numeric data to character Cleanup of data entry errors such as leading and trailing spaces for primary key fields Replacing zero values with NULL values to prevent errors when analyzing data There are also challenges in working with the appraisal parcel boundary data due to the nature in which data is stored within the county GIS systems. For instance, it is typical for a parcel to have one or more account numbers affiliated with each parcel (multiple-owners), or multiple accounts to a parcel such as with a high-rise condominium complex. Instances such as these are typically stored through a means of “stacking” identical parcels on top of one another within the GIS, but giving each parcel feature a different Account Number. Although this may provide for an effective end product for viewing ownership at the parcel level using a single table format, it does not support the establishment of a topologically integrated geographic database, where a single parcel of land can have one or more owners, which is typically represented through a more relational database structure. In the following section, these issues and challenges will be explained in detail, as well as how GIS and SAS® are used together to develop standardized data for the region. APPRAISAL ROLL DATA DEVELOPMENT Writing SAS® INFILE statements can be lengthy when setting up SAS® code to import data files, and appraisal data is no exception. It is common for an appraisal roll dataset to contain more than 100 fields that each need to be listed in the INFILE statement. Therefore an Excel® spreadsheet is used to help generate SAS® code that can be imported into the SAS® editor file. Through the use of Excel® formulas and hard-coded text strings, a list of field names can be loaded into an Excel® spreadsheet, and from there used to generate the INFILE, LENGTH, and INPUT portions of the DATA STEP statement. This method reduces data entry errors as field names are copied, not typed, and saves time. Once data is imported into SAS® dataset format, additional SAS® code is written to clean-up and standardize the datasets into a common data structure for datasets from all counties in the region. Through the use of standardized dataset and fieldnames and formats, the development of data is greatly simplified and aids in the data being used more efficiently when doing analysis of appraisal data. The following are some examples of how SAS® is used to clean-up and standardize the appraisal attribute data. Attribute data is typically provide in either one or several flat-file layouts. These are typically delimited text files using either comma or tab delimiters. In some cases multiple types of dataset are stored in a single text file, and thus, SAS® is used to 2
  3. 3. determine which records to read. For example, it is common for not only the appraisal data that includes ownership, valuation, and land use to exist in one file, but also for summary data that aggregates valuations by subdivision to be in the same file. Through the use of SAS®, a statement such as the one illustrated below can process the file, only importing the records that represent the appraisal roll data. In many cases, multiple import statements are used, so that each type of data can be loaded into a separate SAS® table. The following is an illustration of a conditional import statement that only loads records that have a Record Type of ‘4’ in the source file. Data Appraisal_Data Other_Data; Infile 'Input_file.txt' *Name of flat file to load; MISSOVER lrecl=5000; *Following code specifies field attributes used in conditional processing; Length Record_Type $ 1; *Initializes record type field; Input Record_Type $ 61-61 @; *Defines location of record type value in flat file, @ forces SAS to use buffer to evaluate condition and prevents skipped records; *Following code is conditional processing to only load certain record types; If Record_Type ='4' Then Do; *Only loads record types with a value of ‘4’; Length *Initializes and defines other fields in flat file to import; First_Field $ 10 Second_Field $ 50; *Notice that the Record_Type variable is not used here; Input First_Field $ 1-10 Second_Field $ 11-60 Record_Type $61; Output Appraisal_Data; *Name of dataset to write data; End; Else Output Other_Data; *Puts all other records into a scratch dataset, not used; Run; Once data is loaded into SAS®, additional SAS® statements are used to assist in further cleaning the data. For instance, it is common for some fields to be initially imported as text formats, when in fact they should be defined as numeric. The same holds true for some attributes that are imported as numeric when they should be text (e.g. numbers that have leading zeros). The following are two examples of code that are used in SAS® DATA STEP statements to handle these conversion scenarios. *Code for converting values from Numeric (N) to Character (C); C = Strip(Put(N,10.)); *Where ‘10.’ is the desired character length; *Code for converting values From Character (C) to Numeric (N); N=Input(C,8.0); *Where ‘8.0’ is a numeric informat; Another example of data cleanup that is performed on SAS® datasets is that of replacing zero values with NULL values. For appraisal data, it is typically not sufficient to note some values as being zero. Consider the value of land and improvement. 3
  4. 4. These items, if they exist, have a value associated with them. If the value is not known, then it should not be zero, but rather NULL so as to not skew statistical analysis. For land values in the appraisal roll data that contain a value of ‘0’, those are changed to be NULL, as all land has a value. The same holds true for improvement values, where if an improvement exists, it should have a value greater than zero, so any values of zero are changed to NULL. These changes are performed using a simple IF THEN statement in SAS® to look for zero value and modify the value to be NULL. *Replaces zero values with NULL values; If Land_Value = 0 Then Land_Value = .; If Improvement_Value = 0 Then Improvement_Value = .; In some instances, data such as zip codes are provides as either aggregated values (e.g. 77027-1234) or separated values in their own fields (e.g. 77027, 1234). H-GAC chooses to store zip code data as two separate fields, so for some counties where the data is only provided in an aggregated format, the SAS® SUBSTR statement is used. The following is an example of how two separate zip code field are created from a single aggregate zip code field. *Code separates 5-digit zip prefix from 4-digit suffix; Zip_Code = Substr(Orig_Zip,1,5); *Reads and stores values of positions 1 thru 5; Zip_Code_Plus4 = Substr(Oriz_Zip,7,4); *Reads and stores values of positions 7 thru 10; Finally, in some instances primary key fields and field with formatted codes are missing characters or proceeded by spaces. This can cause issues when trying to join data in multiple tables, as SQL typically views spaces as valid characters, thus a value of ‘R1234’ in one table is not the same as a value preceded by a space such as ‘ R1234’ in another table, with the latter value being a data entry error. To resolve these issues, a DATA STEP statement is used to remove spaces from fields as in the example provides. The following is an example of such as statement. *Removes leading and trailing spaces from account number field; Acct_Num = Strip(Acct_Num); The end result of using SAS® to process appraisal roll data, is a standardized set of SAS® datasets for each county that have common fields and naming conventions for attributes such as owners, legal descriptions, land value, improvement value, and state classification code. Although each set of appraisal data from the county includes far more than just the standardized fields used by H-GAC, these additional fields are not dropped. Instead they are appended to the end of the common variables. From this point, analysis can be run against the SAS® appraisal roll datasets and reports generated, and if needed, exported to other formats such as Excel®, DBF, or delimited files. GIS PARCEL BOUNDARY DATA DEVELOPMENT In additional to performing quality review on attribute data, SAS® is also used to assist in the cleanup of geographic parcel boundary data. Depending upon the type of parcel (residential, commercial, mixed use, etc), parcel features in the GIS dataset may involve multiple features ‘stacked’ on top of one another, with each feature having a corresponding account numbers. For instance, if there were two owners of a single parcel of land, each with their own account number for a single-family residential property, there may be two spatially and geometrically identical polygon features, each with the account number for the corresponding owner for which it represents. Therefore H-GAC uses ESRI® ArcGIS® in conjunction with SAS® to create a single polygon for these features, but retain the multiple account number assignments. In effect, the flat file structure of the appraisal GIS dataset is transformed into a more extensive relational database system, capable of supporting complex analysis. 4
  5. 5. First, GIS is used to calculate a centroid value for each polygon in the original parcel dataset, which is expressed as an X/Y coordinate. Think of an X/Y value as being latitude and longitude values, and if two geometrically identical parcels are stacked on top of one another in the same geographic space, both will have the same X/Y coordinate value, or centroid location. Next, the parcel dataset is then processed using a method called Dissolving, where each polygon is grouped and simplified based on some common value, in this case the X/Y coordinate. The result of the dissolve process is a new dataset that contains only one parcel boundary to a defined space, where before there may have been multiple parcels stacked on top one another. This new dataset also retains the X/Y coordinate value of the final aggregated polygons. If a parcel is not stacked on top of another parcel to begin with, then the dissolve process merely takes the single parcel and places it into the new dataset. What exists at this point are two GIS datasets: The original parcel boundaries, which contain stacked and non-stacked parcels, each with their respective account numbers and X/Y coordinate; and, The dissolved parcel boundaries, which contains only one parcel to an area of land and an X/Y coordinate of each parcel For the newly created dissolved parcel boundaries dataset, each parcel is given a unique parcel identification code, or Parcel ID. The Parcel ID field serves as the primary key for this dataset. Then both parcel datasets are exported to a shapefile format, which stores attribute data such as X/Y coordinate, Account Number, and Parcel ID in a DBF data table. At this point, this is where SAS® assists in the integration of the two datasets into a relational database structure. Due to the large amount of data to be processed for each county, sometimes upwards of 1 million parcels, SAS® is very efficient in handling this volume of data. Using SAS® IMPORT statements as illustrated below, both DBF tables are loaded into SAS®. *Loads original parcels data table containing the Account Number, Parcel ID, and X/Y coordinate of each parcel; Proc Import Out=Original_Parcels Replace Datafile= 'c:Original_Parcels.dbf'; Run; *Loads dissolved parcels data table containing the X/Y coordinate of each parcel; Proc Import Out=Dissolved_Parcels Replace Datafile= 'c:Dissolved_Parcels.dbf'; Run; Next, the two datasets are joined using a PROC SQL LEFT JOIN statement as illustrated below. *Joins dissolved parcels dataset to original parcels dataset to obtain account numbers affiliated with each dissolved parcel; Proc SQL; Create Table Parcel_ID_to_Account_Number as SELECT X.Parcel_ID, X.XY_Coord, Y.Account_Number From Diss_Parcels AS X Left Join Orig_Parcels AS Y On X.XY_Coord = Y.XY_Coord; Quit; 5
  6. 6. The above SAS® statement creates a dataset that contains all Parcel IDs from the dissolved parcels dataset, and their affiliated Account Numbers from the original dataset. The Parcel ID to Account Number table becomes a critical link between the parcel boundary GIS data, and the Appraisal Roll property attribute data. Specifically, it allows for the relating of a single parcel of land to one or more accounts affiliated with that parcel, and then each account to it corresponding record of detail in the Appraisal Roll dataset. The following section will illustrate how having such a table allows H-GAC to produce parcel level land use data for the region. Determination of Land Use from Appraisal Roll Databases H-GAC uses appraisal data as a basis for determining land use in the 13-county region surrounding the Houston Metropolitan area. To process large amounts of appraisal data, H-GAC organizes appraisal records by parcel, which can number upwards of 1 million records for a county, and over 3 million for the region. However, not just appraisal data is used in the land use determination process, as H-GAC also acquires a variety of other data related to land use, such as locations of schools, government buildings, infrastructure, and environmental conservation and park areas. This additional information is used in conjunction with the appraisal roll data to obtain a more accurate land use determination, where none may exist. The first step in the process is to assign each appraisal roll record a Parcel ID. As discussed in the prior section, SAS® was used to process data from the H-GAC GIS to determine parcel assignments for each appraisal account. Using a PROC SQL LEFT JOIN statement illustrated below, each appraisal roll record is assigned to a parcel. *Joins appraisal roll to Parcel ID based on Account Number assigned to parcels; Proc SQL; Create Table Appraisal_Roll_Parcel_ID as SELECT X.Account_Number, X.Owner_Name, X.Legal, X.State_Class_Code, Y.Parcel_ID From Harris_Appraisal_Roll AS X Left Join Parcel_ID_to_Account_Number AS Y On X.Account_Number = Y.Account_Number; Quit; The result of the query is a table that can be used as the basis for the land use model to determine land use and ownership by parcel. Since the process is primarily focused on land use, only a few of the many fields available in the Appraisal Roll dataset are retained for further processing. In order to determine land use, the State_Class_Code field will be the field of focus, as this field contains two-digit codes that denote the type of property (e.g. single-family residential, commercial, industrial, etc). The next step in the process is to determine land use of each parcel based on the State Class Code attribute retained in the prior query. Each record in the Appraisal Roll dataset is aggregated by the combined values of the Parcel_ID and State_Class_Code fields. This prevents two different accounts with the same Parcel ID and State Class Code from being listed more than once. For instance, if account ‘R12345’ had as State Class Code of ‘A1’, and account ‘R45678’ has a State Class Code of ‘A1’, and both were assigned to Parcel Id ‘HR890’, then all that is needed is a record that lists parcel HR890 as having a State Class Code of ‘A1’. Alternatively, if one of the State Class Codes for the above two accounts was different, say ‘A2’ for account R45678, then two records would be produced for parcel HR890, one with a State Class Code value of ‘A1’, and another with a value of ‘A2’. The following is an illustration of the PROC SQL code used for this step in the process. 6
  7. 7. *Keeps only unique Parcel ID and State Class Code combinations; Proc SQL; Create Table Unique_Parcels_SC AS SELECT Distinct(Parcel_ID) AS Unique_Parcel_ID, State_Class_Code, Count(State_Class_Code) AS NumberOfDups From Appraisal_Roll_Parcel_ID GROUP BY Parcel_ID, State_Class_Code Having NumberOfDups >= 0; Quit; As the next step, two SAS® procedures are used to transpose the vertical records for each parcel, whether it is a single State Class Code or multiple, into columns. Next those columns are then merged to create a single State Class Code field or SSC. *Creates counter to identify first Parcel ID record; Data Unique_Parcels_SC_N (Rename =(Unique_Parcel_ID = Parcel_ID)); Retain Counter; Set Unique_Parcels_SC (Drop = NumberOfDups); By Unique_Parcel_ID; If First.Unique_Parcel_ID Then Counter = 1; Else Counter = Counter +1; Run; The result of the above statement is a dataset that numbers each Parcel ID observation in order starting with a value of ‘1’ for the first instance, and then ‘2’, ‘3’, etc if there are additional observations for that Parcel ID. This dataset is then used as input to the PROC TRANSPOSE statement below. *Transposes based on Parcel ID for each State Class Code value; Proc Transpose Data =Unique_Parcels_SC_N Out = Parcels_SC_Horiz (Drop = _Name_); By Parcel_ID; Var State_Class_Code; ID Counter; Run; The result of the above statement is a table that lists each Parcel ID as a record with one or more values in horizontal attribute columns. Some parcels may have only one State Class Code value, whereas other may have several, and thus the dataset may have anywhere from one to seven attribute field for each transposed value. Those multiple values are then merged into a single State Class Code field as illustrated below. *Creates final transposed parcel to state class code dataset; Data Parcel_SSC (Keep = Parcel_ID State_Class_Code); Set Parcels_SC_Horiz; Length State_Class_Code $10; *Set field size to be sum of all variables being merged; State_Class_Code = Strip(Strip(_1)||' '||Strip(_2)); *Merges multiple values; Run; 7
  8. 8. The above statement creates a two column table that contains a field for Parcel ID and the merged State Class Code value stored as SSC. Also, the Strip command is used to remove any leading or trailing spaces as a result of merging fields that may be empty. Next the Parcel_SSC table then joined with a Land Use to State Class Code lookup table to assign a land use code for each parcel. H-GAC has defined approximately 70 land use types and has grouped them into 8 Land Use Categories. The Land Use to State Class Code lookup table includes the following fields: Land Use Code, Land Use Category, and State Class Code. Using a PROC SQL LEFT JOIN statement, the Parcel_SSC table is joined to the Land Use to State Class Code lookup table to obtain the corresponding Land Use Code and Land Use Category information for that parcel based on its State Class Code value. At this point, a baseline land use determination is established for each parcel. However, as previously mentioned, H-GAC has additional information that can supplement the appraisal data to determine a more accurate land use classification. This supplemental information is helpful, as many appraisal roll records have Exempt status for their State Class Code values. Exempt properties are typically schools, religious entities, government property, public infrastructure, and natural areas that are not typically taxed as non-exempt properties. As a separate initiative, H-GAC uses GIS to overlay source data representing these types of properties on top of the parcel boundary framework, in order to obtain Parcel IDs for each of these entities. That information for each geographic dataset is then aggregated and place into a single Land Use Overrides table that contains fields for the Parcel ID and the Land Use Code determined by the nature of the source geographic data (e.g. school, religious, government owned, park, etc). As a final step to creating a regional land use dataset, the baseline land use data developed in SAS® is then joined with the Land Use Overrides table using as series of SAS® statements. This series of statements evaluates each parcel’s override table value to determine if it is the same as the parcel’s baseline value, and if it is, then the override value is ignored and the existing land use value determined from the appraisal roll data is retained. This allows for a more accurate tracking of how land use was determined, and helps to gauge the accuracy of appraisal data over time. Furthermore, if there are any conflicting values in the override table for a parcel, such as a parcel being listed as both a commercial facility and an industrial facility, those override records are ignored as well, and an error report table is produced so that the override values can be investigated further and corrected. What remains following the override audit steps are a final list of land use codes that should replace the existing baseline land use determination values. The override values are then joined to the baseline land use table and a final land use code is determined for each parcel, where a valid override value exists, and for those parcels that do not have a match with the override table, they retain their baseline value. As a final output of the land use model, SAS® is used to create land use datasets that can be joined with GIS datasets using the Parcel ID value. This allows for a simplified method in which to produce regional land use maps. Furthermore, SAS® is used to summarize the land use table by land use type to determine the amount of acreage in the region for each land use type. This is accomplished by joining the land use table to a table that lists each parcel and its acreage. Conclusion As discussed in this paper, H-GAC uses SAS® as a critical component to determining land use for the region. The regional land use efforts are not a process that can be accomplished through the use of a single technology or software platform, but rather by integrating two separate software products. By using the best capabilities of two different systems, ESRI® ArcGIS® and SAS®, an integrated process has been developed. This process assists in overcoming challenges such as large volume datasets, quality review/control of variables, and relating multiple datasets from different sources together to create a comprehensive regional database. Furthermore, it allows H-GAC to conduct regional analysis by standardizing data across all county geographies. 8
  9. 9. References H-GAC (Houston-Galveston Area Council). 2008. www.h-gac.com. Contact Information Bill Bass, GISP Houston-Galveston Area Council Socio-Economic Modeling 3555 Timmons Lane Suite 120 Houston, Texas 77027 (713) 499-6687 William.Bass@h-gac.com 9

×