Crawley using r to evaluate street stress on park use
Lowry colorado state address dataset data quality
1. Colorado State Address Dataset
Data Quality
Nathan Lowry, GIS Outreach Coordinator
State of Colorado
September 24, 2014
2. Data Quality
Two Tracks:
1.Develop criteria and measure quality
•Develop quality measures in relation to ISO standards
••Draw from measures in standards and practice
2.Compare for potential corrective actions
••Master Street Address Guide (MSAG) and ALI
•US Postal Service Address Quality Improvement (CASS)
•Statewide Voter Registration System (SCORE)
•Motorist Insurance Identification Database (MIIDB)
3. ISO 19157 Geographic information - Data quality
Defines comprehensive definitions and testing guidance to measure data quality:
completeness: presence and absence of features, their attributes and relationships
•commission: excess data present in a dataset
•omission: data absent from a dataset
logical consistency: degree of adherence to logical rules of data structure, attribution and relationships (data structure can be conceptual,
logical or physical)
•conceptual consistency: adherence to rules of the conceptual schema
•domain consistency: adherence of values to the value domains
•format consistency: degree to which data is stored in accordance with the physical structure of the dataset
•topological consistency: correctness of the explicitly encoded topological characteristics of a dataset
positional accuracy: accuracy of the position of features
•absolute (or external) accuracy: closeness of reported coordinate values to values accepted as or being true
•relative (or internal) accuracy: closeness of the relative positions of features in a dataset to their respective relative positions
accepted as or being true
•gridded data position accuracy: closeness of gridded data position values to values accepted as or being true.
temporal quality: accuracy of the temporal attributes and temporal relationships of features
•accuracy of a time measurement: correctness of the temporal references of an item (reporting of error in time measurement)
•temporal consistency: correctness of ordered events or sequences, if reported
••temporal validity: validity of data with respect to time
thematic accuracy: accuracy of quantitative attributes and the correctness of non-quantitative attributes and of the classifications of
features and their relationships.
•classification correctness: comparison of the classes assigned to features or their attributes to a universe of discourse (e.g.
ground truth or reference dataset)
•non-quantitative attribute correctness: correctness of non-quantitative attributes
attributes,
Governor's Office of Information Technology ~ Executive Leadership Team
•quantitative attribute accuracy: accuracy of quantitative attributes
4. Determining Sampling Size
Sample Size and Confidence Interval Tutorial
● The confidence interval (commonly referred to as the margin of error or error rate) is the plus-or-minus
figure you hear mentioned relative to surveys or opinion polls. For example, if you use a confidence
interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the
question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that
answer. Most researchers prefer a confidence interval of less than 4 percentage points.
● The confidence level tells you how sure you can be. Expressed as a percentage, it represents how often
the true percentage of the population who would pick an answer lies within the confidence interval. The
95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99%
certain. Most researchers use the 95% confidence level.
● When you put the confidence level and the confidence interval together, you can say (for example) that
you are 95% sure that the true percentage of the population is between 43% and 51%.
● The wider the confidence interval (higher margin of error) you are willing to accept, the more certain
you can be that the whole population answers would be within that range. For example, if you asked a
l f 1000 l i it sample of people in a city which brand of cola they preferred, and 60% said Brand A, you can be
very certain that between 40 and 80% (80% confidence interval) of all the people in the city actually do
prefer that brand. However, you cannot be so sure that between 59 and 61% (99% confidence interval) of
the people in the city prefer the brand.
Governor's Office of Information Technology ~ Executive Leadership Team
12. Data Quality - Sampling Method
4. Select the address points associated with the selected road
segments
13. Data Quality - Sampling Method
5. Repeat steps 3 & 4 until sample size is exceeded
14. Data Quality - Sampling Method
5. Repeat steps 3 & 4 until sample size is exceeded
15. Data Quality – Sampling Method
Governor's Office of Information Technology ~ Executive Leadership Team
16. Data Quality: The DPS-1 Universe
Governor's Office of Information Technology ~ Executive Leadership Team
17. Data Quality: DPS-1 Sample Sites
Governor's Office of Information Technology ~ Executive Leadership Team
18. Data Quality: DPS-1 Sample Sites 1 & 2
Governor's Office of Information Technology ~ Executive Leadership Team
19. Data Quality: DPS-1 Sample Sites 3, 4, and 5
Governor's Office of Information Technology ~ Executive Leadership Team
20. Data Quality - Completeness
•Omissions – Correct location which is missed (a point present in OIT data but missing in DPS
data)
Q y p
•Commission – A location point created in error (a point present in DPS data which does not
exist in OIT data)
•Omissions and Commissions are defined based on assumption that OIT data is correct
Results
••We weight the omissions and commissions equally using this formula –
0.5(OmissionPct) + 0.5(CommissionPct) = Overall Percent Score
•Apartments Only = 70.91%
•Houses and Commercial = 89.59%
•All = 75.68%
•Not reflective of all DPS
addresses
•Apartment inaccuracies sway
aggregate percentage heavily
•Apartments greatest area of
concern
Governor's Office of Information Technology ~ Executive Leadership Team
co ce
21. Data Quality - Positional Accuracy
•OIT locations and DPS locations compared spatially
•Line segments are created to link DPS location to its correlating OIT point
•Severe errors primarily present in apartment locations.
•Few true house errors, most are inconsistent of OIT points due to use of Laser Range Finders
Issues with Apartments
•Stacking – Many apartments stacked on top of each other in one location
•Consequently, lack of spatial differentiation is present
•Spatial inaccuracy is significant
1.7308 * SQRT( ([Δ1]2 + [Δ2]2 + [Δ3]2 + … + [Δn]2)/n )
Where –
● 1.7308 = Standard Error in the Horizontal
● ΔΔ = Distance (Feet)
● n = Number of Distances
Houses and Commercial = 38 feet horizontal
accuracy at 95% confidence
Apartments= 125 feet horizontal accuracy at 95%
confidence
All = 105 feet horizontal accuracy at 95%
confidence
•We again see the apartments swaying the overall
Governor's Office of Information Technology ~ Executive Leadership Team
results, while houses feature far less error
22. Data Quality - Logical Consistency
•DPS points are geocoded to Denver Public Road data.
•Lines are used to link geocoded points to respective actual DPS points
•Goal is to identify logical consistency errors in DPS data, however –
•Points are geocoded to Denver Public Roads data, thus errors could arise from either side
•Sequential Error –These are errors in the
sequence/order of address numbers
••Parity Error – These are errors in odd/even
positioning of address locations and address
numbers
•Out-Of-Range – These are errors in the placement
of the address point well beyond the range
allowed in the road centerline data for the same
section of road sampled
•Anomaly – An inconsistency in the sequence, parity,
or range of an address point, but which is not
inconsistent with verified field values
Governor's Office of Information Technology ~ Executive Leadership Team
23. Data Quality - Temporal Quality
•Temporal quality assesses the frequency and types of modifications and updates made to the
data set
•11-16-2010 to 10-24-2013
•Improvements can still be made
•It is important to begin tracking timing difference between data collection and data updating
Governor's Office of Information Technology ~ Executive Leadership Team
24. Data Quality - Thematic Accuracy
•Thematic errors are errors present in the attribution of each point
•Example – 310 Blake Street is in fact 312 Blake Street
•9 errors were found in private home neighborhoods of which there are 436
•Selected sample of 25 errors from apartment locations
•15 Duplicates
•6 Incomplete Addresses
•4 Does Not Exist
•Apartments again the biggest culprit
•Further investigation into the level of thematic error in apartment complexes may be
necessary to sufficiently characterize the quality, but may also be significantly ambiguous
•Best guess in determining whether it is a thematic error (wrong address number) or another
type of error (positional accuracy error, logical consistency anomaly) is suspect esp.
/
Governor's Office of Information Technology ~ Executive Leadership Team
w/apartments
25. Data Quality - Thematic Accuracy
Governor's Office of Information Technology ~ Executive Leadership Team