SlideShare a Scribd company logo
1 of 26
Download to read offline
Colorado State Address Dataset 
Data Quality 
Nathan Lowry, GIS Outreach Coordinator 
State of Colorado 
September 24, 2014
Data Quality 
Two Tracks: 
1.Develop criteria and measure quality 
•Develop quality measures in relation to ISO standards 
••Draw from measures in standards and practice 
2.Compare for potential corrective actions 
••Master Street Address Guide (MSAG) and ALI 
•US Postal Service Address Quality Improvement (CASS) 
•Statewide Voter Registration System (SCORE) 
•Motorist Insurance Identification Database (MIIDB)
ISO 19157 Geographic information - Data quality 
Defines comprehensive definitions and testing guidance to measure data quality: 
completeness: presence and absence of features, their attributes and relationships 
•commission: excess data present in a dataset 
•omission: data absent from a dataset 
logical consistency: degree of adherence to logical rules of data structure, attribution and relationships (data structure can be conceptual, 
logical or physical) 
•conceptual consistency: adherence to rules of the conceptual schema 
•domain consistency: adherence of values to the value domains 
•format consistency: degree to which data is stored in accordance with the physical structure of the dataset 
•topological consistency: correctness of the explicitly encoded topological characteristics of a dataset 
positional accuracy: accuracy of the position of features 
•absolute (or external) accuracy: closeness of reported coordinate values to values accepted as or being true 
•relative (or internal) accuracy: closeness of the relative positions of features in a dataset to their respective relative positions 
accepted as or being true 
•gridded data position accuracy: closeness of gridded data position values to values accepted as or being true. 
temporal quality: accuracy of the temporal attributes and temporal relationships of features 
•accuracy of a time measurement: correctness of the temporal references of an item (reporting of error in time measurement) 
•temporal consistency: correctness of ordered events or sequences, if reported 
••temporal validity: validity of data with respect to time 
thematic accuracy: accuracy of quantitative attributes and the correctness of non-quantitative attributes and of the classifications of 
features and their relationships. 
•classification correctness: comparison of the classes assigned to features or their attributes to a universe of discourse (e.g. 
ground truth or reference dataset) 
•non-quantitative attribute correctness: correctness of non-quantitative attributes 
attributes, 
Governor's Office of Information Technology ~ Executive Leadership Team 
•quantitative attribute accuracy: accuracy of quantitative attributes
Determining Sampling Size 
Sample Size and Confidence Interval Tutorial 
● The confidence interval (commonly referred to as the margin of error or error rate) is the plus-or-minus 
figure you hear mentioned relative to surveys or opinion polls. For example, if you use a confidence 
interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the 
question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that 
answer. Most researchers prefer a confidence interval of less than 4 percentage points. 
● The confidence level tells you how sure you can be. Expressed as a percentage, it represents how often 
the true percentage of the population who would pick an answer lies within the confidence interval. The 
95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% 
certain. Most researchers use the 95% confidence level. 
● When you put the confidence level and the confidence interval together, you can say (for example) that 
you are 95% sure that the true percentage of the population is between 43% and 51%. 
● The wider the confidence interval (higher margin of error) you are willing to accept, the more certain 
you can be that the whole population answers would be within that range. For example, if you asked a 
l f 1000 l i it sample of people in a city which brand of cola they preferred, and 60% said Brand A, you can be 
very certain that between 40 and 80% (80% confidence interval) of all the people in the city actually do 
prefer that brand. However, you cannot be so sure that between 59 and 61% (99% confidence interval) of 
the people in the city prefer the brand. 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality - Sampling Size
Data Quality - Sampling Size
Data Quality - Sampling Size 
With a confidence interval of 3 percentage points and a 95 % confidence level:
Data Quality - Sampling Method 
1. Randomly select 5 address points 
2. Select road segments associated with address points 
3. Select adjacent connected road segments 
4. Select the address points associated with the selected 
road segments 
5. Repeat steps 3 & 4 until sample size is exceeded
Data Quality - Sampling Method 
1. Randomly select 5 address points
Data Quality - Sampling Method 
2. Select road segments associated with address points
Data Quality - Sampling Method 
3. Select adjacent connected road segments
Data Quality - Sampling Method 
4. Select the address points associated with the selected road 
segments
Data Quality - Sampling Method 
5. Repeat steps 3 & 4 until sample size is exceeded
Data Quality - Sampling Method 
5. Repeat steps 3 & 4 until sample size is exceeded
Data Quality – Sampling Method 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality: The DPS-1 Universe 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality: DPS-1 Sample Sites 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality: DPS-1 Sample Sites 1 & 2 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality: DPS-1 Sample Sites 3, 4, and 5 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality - Completeness 
•Omissions – Correct location which is missed (a point present in OIT data but missing in DPS 
data) 
Q y p 
•Commission – A location point created in error (a point present in DPS data which does not 
exist in OIT data) 
•Omissions and Commissions are defined based on assumption that OIT data is correct 
Results 
••We weight the omissions and commissions equally using this formula – 
0.5(OmissionPct) + 0.5(CommissionPct) = Overall Percent Score 
•Apartments Only = 70.91% 
•Houses and Commercial = 89.59% 
•All = 75.68% 
•Not reflective of all DPS 
addresses 
•Apartment inaccuracies sway 
aggregate percentage heavily 
•Apartments greatest area of 
concern 
Governor's Office of Information Technology ~ Executive Leadership Team 
co ce
Data Quality - Positional Accuracy 
•OIT locations and DPS locations compared spatially 
•Line segments are created to link DPS location to its correlating OIT point 
•Severe errors primarily present in apartment locations. 
•Few true house errors, most are inconsistent of OIT points due to use of Laser Range Finders 
Issues with Apartments 
•Stacking – Many apartments stacked on top of each other in one location 
•Consequently, lack of spatial differentiation is present 
•Spatial inaccuracy is significant 
1.7308 * SQRT( ([Δ1]2 + [Δ2]2 + [Δ3]2 + … + [Δn]2)/n ) 
Where – 
● 1.7308 = Standard Error in the Horizontal 
● ΔΔ = Distance (Feet) 
● n = Number of Distances 
Houses and Commercial = 38 feet horizontal 
accuracy at 95% confidence 
Apartments= 125 feet horizontal accuracy at 95% 
confidence 
All = 105 feet horizontal accuracy at 95% 
confidence 
•We again see the apartments swaying the overall 
Governor's Office of Information Technology ~ Executive Leadership Team 
results, while houses feature far less error
Data Quality - Logical Consistency 
•DPS points are geocoded to Denver Public Road data. 
•Lines are used to link geocoded points to respective actual DPS points 
•Goal is to identify logical consistency errors in DPS data, however – 
•Points are geocoded to Denver Public Roads data, thus errors could arise from either side 
•Sequential Error –These are errors in the 
sequence/order of address numbers 
••Parity Error – These are errors in odd/even 
positioning of address locations and address 
numbers 
•Out-Of-Range – These are errors in the placement 
of the address point well beyond the range 
allowed in the road centerline data for the same 
section of road sampled 
•Anomaly – An inconsistency in the sequence, parity, 
or range of an address point, but which is not 
inconsistent with verified field values 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality - Temporal Quality 
•Temporal quality assesses the frequency and types of modifications and updates made to the 
data set 
•11-16-2010 to 10-24-2013 
•Improvements can still be made 
•It is important to begin tracking timing difference between data collection and data updating 
Governor's Office of Information Technology ~ Executive Leadership Team
Data Quality - Thematic Accuracy 
•Thematic errors are errors present in the attribution of each point 
•Example – 310 Blake Street is in fact 312 Blake Street 
•9 errors were found in private home neighborhoods of which there are 436 
•Selected sample of 25 errors from apartment locations 
•15 Duplicates 
•6 Incomplete Addresses 
•4 Does Not Exist 
•Apartments again the biggest culprit 
•Further investigation into the level of thematic error in apartment complexes may be 
necessary to sufficiently characterize the quality, but may also be significantly ambiguous 
•Best guess in determining whether it is a thematic error (wrong address number) or another 
type of error (positional accuracy error, logical consistency anomaly) is suspect esp. 
/ 
Governor's Office of Information Technology ~ Executive Leadership Team 
w/apartments
Data Quality - Thematic Accuracy 
Governor's Office of Information Technology ~ Executive Leadership Team
Questions? 
Thank You!

More Related Content

Similar to Lowry colorado state address dataset data quality

Data Collection Preparation
Data Collection PreparationData Collection Preparation
Data Collection PreparationBusiness Student
 
Matrix Adjustments – How to build better matrices
Matrix Adjustments – How to build better matricesMatrix Adjustments – How to build better matrices
Matrix Adjustments – How to build better matricesJumpingJaq
 
Survival Guide: Taming the Data Quality Beast
Survival Guide: Taming the Data Quality BeastSurvival Guide: Taming the Data Quality Beast
Survival Guide: Taming the Data Quality BeastTechWell
 
ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life CycleSrujanaMerugu1
 
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...DATAVERSITY
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 
Geocoding Best Practices: Taking an Address-Centric Approach
Geocoding Best Practices: Taking an Address-Centric ApproachGeocoding Best Practices: Taking an Address-Centric Approach
Geocoding Best Practices: Taking an Address-Centric ApproachPrecisely
 
Top 10 Tips for Retail Site Selection
Top 10 Tips for Retail Site SelectionTop 10 Tips for Retail Site Selection
Top 10 Tips for Retail Site SelectionPrecisely
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2malathieswaran29
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aRai University
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 

Similar to Lowry colorado state address dataset data quality (20)

Data Collection Preparation
Data Collection PreparationData Collection Preparation
Data Collection Preparation
 
Matrix Adjustments – How to build better matrices
Matrix Adjustments – How to build better matricesMatrix Adjustments – How to build better matrices
Matrix Adjustments – How to build better matrices
 
Survival Guide: Taming the Data Quality Beast
Survival Guide: Taming the Data Quality BeastSurvival Guide: Taming the Data Quality Beast
Survival Guide: Taming the Data Quality Beast
 
Introduction to Survey Data Quality
Introduction to Survey Data Quality  Introduction to Survey Data Quality
Introduction to Survey Data Quality
 
ML Application Life Cycle
ML Application Life CycleML Application Life Cycle
ML Application Life Cycle
 
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
Conformed Dimensions of Data Quality – An Organized Approach to Data Quality ...
 
Presentation 1.pptx
Presentation 1.pptxPresentation 1.pptx
Presentation 1.pptx
 
2021 Census collection strategy
2021 Census collection strategy2021 Census collection strategy
2021 Census collection strategy
 
Fundamental of Quality Data - Anthony Ndungu
Fundamental of Quality Data - Anthony NdunguFundamental of Quality Data - Anthony Ndungu
Fundamental of Quality Data - Anthony Ndungu
 
Data Quality
Data QualityData Quality
Data Quality
 
5 data analysis approaches dr. hueihsia holloman
5 data analysis approaches dr. hueihsia holloman5 data analysis approaches dr. hueihsia holloman
5 data analysis approaches dr. hueihsia holloman
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Geocoding Best Practices: Taking an Address-Centric Approach
Geocoding Best Practices: Taking an Address-Centric ApproachGeocoding Best Practices: Taking an Address-Centric Approach
Geocoding Best Practices: Taking an Address-Centric Approach
 
Top 10 Tips for Retail Site Selection
Top 10 Tips for Retail Site SelectionTop 10 Tips for Retail Site Selection
Top 10 Tips for Retail Site Selection
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 

More from GeCo in the Rockies

Fusion of Geodesy and GIS at NOAA as NGS
Fusion of Geodesy and GIS at NOAA as NGSFusion of Geodesy and GIS at NOAA as NGS
Fusion of Geodesy and GIS at NOAA as NGSGeCo in the Rockies
 
Stone national spatial reference system heights
Stone national spatial reference system   heightsStone national spatial reference system   heights
Stone national spatial reference system heightsGeCo in the Rockies
 
Edwards frontier precision terrestrial imagingandmeasurement
Edwards frontier precision terrestrial imagingandmeasurementEdwards frontier precision terrestrial imagingandmeasurement
Edwards frontier precision terrestrial imagingandmeasurementGeCo in the Rockies
 
Siddle connecting surveying and mgis to mesa countys rtrn
Siddle connecting surveying and mgis to mesa countys rtrnSiddle connecting surveying and mgis to mesa countys rtrn
Siddle connecting surveying and mgis to mesa countys rtrnGeCo in the Rockies
 
Londe mobile devices appropriate uses
Londe mobile devices appropriate usesLonde mobile devices appropriate uses
Londe mobile devices appropriate usesGeCo in the Rockies
 
Vetter employee residence reports weld county
Vetter employee residence reports weld countyVetter employee residence reports weld county
Vetter employee residence reports weld countyGeCo in the Rockies
 
Caldwell community sustainability and land use policy
Caldwell community sustainability and land use policyCaldwell community sustainability and land use policy
Caldwell community sustainability and land use policyGeCo in the Rockies
 
Behunin and lasslo inexpensive mobile mapping solutions
Behunin and lasslo inexpensive mobile mapping solutionsBehunin and lasslo inexpensive mobile mapping solutions
Behunin and lasslo inexpensive mobile mapping solutionsGeCo in the Rockies
 
Crawley using r to evaluate street stress on park use
Crawley using r to evaluate street stress on park useCrawley using r to evaluate street stress on park use
Crawley using r to evaluate street stress on park useGeCo in the Rockies
 

More from GeCo in the Rockies (20)

Fusion of Geodesy and GIS at NOAA as NGS
Fusion of Geodesy and GIS at NOAA as NGSFusion of Geodesy and GIS at NOAA as NGS
Fusion of Geodesy and GIS at NOAA as NGS
 
Stone national spatial reference system heights
Stone national spatial reference system   heightsStone national spatial reference system   heights
Stone national spatial reference system heights
 
Buck appgeo
Buck appgeoBuck appgeo
Buck appgeo
 
Edwards frontier precision terrestrial imagingandmeasurement
Edwards frontier precision terrestrial imagingandmeasurementEdwards frontier precision terrestrial imagingandmeasurement
Edwards frontier precision terrestrial imagingandmeasurement
 
Siddle connecting surveying and mgis to mesa countys rtrn
Siddle connecting surveying and mgis to mesa countys rtrnSiddle connecting surveying and mgis to mesa countys rtrn
Siddle connecting surveying and mgis to mesa countys rtrn
 
Stone four corners monument
Stone four corners monumentStone four corners monument
Stone four corners monument
 
Isaac esri living atlas
Isaac esri living atlasIsaac esri living atlas
Isaac esri living atlas
 
Londe mobile devices appropriate uses
Londe mobile devices appropriate usesLonde mobile devices appropriate uses
Londe mobile devices appropriate uses
 
Lindemann arc gis forlocalgovt
Lindemann arc gis forlocalgovtLindemann arc gis forlocalgovt
Lindemann arc gis forlocalgovt
 
Duran here presentation
Duran here presentationDuran here presentation
Duran here presentation
 
Underwood esri serug
Underwood esri serugUnderwood esri serug
Underwood esri serug
 
Korris national map corps
Korris national map corpsKorris national map corps
Korris national map corps
 
Chamberlain hazus
Chamberlain hazusChamberlain hazus
Chamberlain hazus
 
Gup web mobilegis
Gup web mobilegisGup web mobilegis
Gup web mobilegis
 
Vetter employee residence reports weld county
Vetter employee residence reports weld countyVetter employee residence reports weld county
Vetter employee residence reports weld county
 
Caldwell community sustainability and land use policy
Caldwell community sustainability and land use policyCaldwell community sustainability and land use policy
Caldwell community sustainability and land use policy
 
Caldwell uas
Caldwell uasCaldwell uas
Caldwell uas
 
Gijselaers lights camerang911
Gijselaers lights camerang911Gijselaers lights camerang911
Gijselaers lights camerang911
 
Behunin and lasslo inexpensive mobile mapping solutions
Behunin and lasslo inexpensive mobile mapping solutionsBehunin and lasslo inexpensive mobile mapping solutions
Behunin and lasslo inexpensive mobile mapping solutions
 
Crawley using r to evaluate street stress on park use
Crawley using r to evaluate street stress on park useCrawley using r to evaluate street stress on park use
Crawley using r to evaluate street stress on park use
 

Lowry colorado state address dataset data quality

  • 1. Colorado State Address Dataset Data Quality Nathan Lowry, GIS Outreach Coordinator State of Colorado September 24, 2014
  • 2. Data Quality Two Tracks: 1.Develop criteria and measure quality •Develop quality measures in relation to ISO standards ••Draw from measures in standards and practice 2.Compare for potential corrective actions ••Master Street Address Guide (MSAG) and ALI •US Postal Service Address Quality Improvement (CASS) •Statewide Voter Registration System (SCORE) •Motorist Insurance Identification Database (MIIDB)
  • 3. ISO 19157 Geographic information - Data quality Defines comprehensive definitions and testing guidance to measure data quality: completeness: presence and absence of features, their attributes and relationships •commission: excess data present in a dataset •omission: data absent from a dataset logical consistency: degree of adherence to logical rules of data structure, attribution and relationships (data structure can be conceptual, logical or physical) •conceptual consistency: adherence to rules of the conceptual schema •domain consistency: adherence of values to the value domains •format consistency: degree to which data is stored in accordance with the physical structure of the dataset •topological consistency: correctness of the explicitly encoded topological characteristics of a dataset positional accuracy: accuracy of the position of features •absolute (or external) accuracy: closeness of reported coordinate values to values accepted as or being true •relative (or internal) accuracy: closeness of the relative positions of features in a dataset to their respective relative positions accepted as or being true •gridded data position accuracy: closeness of gridded data position values to values accepted as or being true. temporal quality: accuracy of the temporal attributes and temporal relationships of features •accuracy of a time measurement: correctness of the temporal references of an item (reporting of error in time measurement) •temporal consistency: correctness of ordered events or sequences, if reported ••temporal validity: validity of data with respect to time thematic accuracy: accuracy of quantitative attributes and the correctness of non-quantitative attributes and of the classifications of features and their relationships. •classification correctness: comparison of the classes assigned to features or their attributes to a universe of discourse (e.g. ground truth or reference dataset) •non-quantitative attribute correctness: correctness of non-quantitative attributes attributes, Governor's Office of Information Technology ~ Executive Leadership Team •quantitative attribute accuracy: accuracy of quantitative attributes
  • 4. Determining Sampling Size Sample Size and Confidence Interval Tutorial ● The confidence interval (commonly referred to as the margin of error or error rate) is the plus-or-minus figure you hear mentioned relative to surveys or opinion polls. For example, if you use a confidence interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that answer. Most researchers prefer a confidence interval of less than 4 percentage points. ● The confidence level tells you how sure you can be. Expressed as a percentage, it represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level. ● When you put the confidence level and the confidence interval together, you can say (for example) that you are 95% sure that the true percentage of the population is between 43% and 51%. ● The wider the confidence interval (higher margin of error) you are willing to accept, the more certain you can be that the whole population answers would be within that range. For example, if you asked a l f 1000 l i it sample of people in a city which brand of cola they preferred, and 60% said Brand A, you can be very certain that between 40 and 80% (80% confidence interval) of all the people in the city actually do prefer that brand. However, you cannot be so sure that between 59 and 61% (99% confidence interval) of the people in the city prefer the brand. Governor's Office of Information Technology ~ Executive Leadership Team
  • 5. Data Quality - Sampling Size
  • 6. Data Quality - Sampling Size
  • 7. Data Quality - Sampling Size With a confidence interval of 3 percentage points and a 95 % confidence level:
  • 8. Data Quality - Sampling Method 1. Randomly select 5 address points 2. Select road segments associated with address points 3. Select adjacent connected road segments 4. Select the address points associated with the selected road segments 5. Repeat steps 3 & 4 until sample size is exceeded
  • 9. Data Quality - Sampling Method 1. Randomly select 5 address points
  • 10. Data Quality - Sampling Method 2. Select road segments associated with address points
  • 11. Data Quality - Sampling Method 3. Select adjacent connected road segments
  • 12. Data Quality - Sampling Method 4. Select the address points associated with the selected road segments
  • 13. Data Quality - Sampling Method 5. Repeat steps 3 & 4 until sample size is exceeded
  • 14. Data Quality - Sampling Method 5. Repeat steps 3 & 4 until sample size is exceeded
  • 15. Data Quality – Sampling Method Governor's Office of Information Technology ~ Executive Leadership Team
  • 16. Data Quality: The DPS-1 Universe Governor's Office of Information Technology ~ Executive Leadership Team
  • 17. Data Quality: DPS-1 Sample Sites Governor's Office of Information Technology ~ Executive Leadership Team
  • 18. Data Quality: DPS-1 Sample Sites 1 & 2 Governor's Office of Information Technology ~ Executive Leadership Team
  • 19. Data Quality: DPS-1 Sample Sites 3, 4, and 5 Governor's Office of Information Technology ~ Executive Leadership Team
  • 20. Data Quality - Completeness •Omissions – Correct location which is missed (a point present in OIT data but missing in DPS data) Q y p •Commission – A location point created in error (a point present in DPS data which does not exist in OIT data) •Omissions and Commissions are defined based on assumption that OIT data is correct Results ••We weight the omissions and commissions equally using this formula – 0.5(OmissionPct) + 0.5(CommissionPct) = Overall Percent Score •Apartments Only = 70.91% •Houses and Commercial = 89.59% •All = 75.68% •Not reflective of all DPS addresses •Apartment inaccuracies sway aggregate percentage heavily •Apartments greatest area of concern Governor's Office of Information Technology ~ Executive Leadership Team co ce
  • 21. Data Quality - Positional Accuracy •OIT locations and DPS locations compared spatially •Line segments are created to link DPS location to its correlating OIT point •Severe errors primarily present in apartment locations. •Few true house errors, most are inconsistent of OIT points due to use of Laser Range Finders Issues with Apartments •Stacking – Many apartments stacked on top of each other in one location •Consequently, lack of spatial differentiation is present •Spatial inaccuracy is significant 1.7308 * SQRT( ([Δ1]2 + [Δ2]2 + [Δ3]2 + … + [Δn]2)/n ) Where – ● 1.7308 = Standard Error in the Horizontal ● ΔΔ = Distance (Feet) ● n = Number of Distances Houses and Commercial = 38 feet horizontal accuracy at 95% confidence Apartments= 125 feet horizontal accuracy at 95% confidence All = 105 feet horizontal accuracy at 95% confidence •We again see the apartments swaying the overall Governor's Office of Information Technology ~ Executive Leadership Team results, while houses feature far less error
  • 22. Data Quality - Logical Consistency •DPS points are geocoded to Denver Public Road data. •Lines are used to link geocoded points to respective actual DPS points •Goal is to identify logical consistency errors in DPS data, however – •Points are geocoded to Denver Public Roads data, thus errors could arise from either side •Sequential Error –These are errors in the sequence/order of address numbers ••Parity Error – These are errors in odd/even positioning of address locations and address numbers •Out-Of-Range – These are errors in the placement of the address point well beyond the range allowed in the road centerline data for the same section of road sampled •Anomaly – An inconsistency in the sequence, parity, or range of an address point, but which is not inconsistent with verified field values Governor's Office of Information Technology ~ Executive Leadership Team
  • 23. Data Quality - Temporal Quality •Temporal quality assesses the frequency and types of modifications and updates made to the data set •11-16-2010 to 10-24-2013 •Improvements can still be made •It is important to begin tracking timing difference between data collection and data updating Governor's Office of Information Technology ~ Executive Leadership Team
  • 24. Data Quality - Thematic Accuracy •Thematic errors are errors present in the attribution of each point •Example – 310 Blake Street is in fact 312 Blake Street •9 errors were found in private home neighborhoods of which there are 436 •Selected sample of 25 errors from apartment locations •15 Duplicates •6 Incomplete Addresses •4 Does Not Exist •Apartments again the biggest culprit •Further investigation into the level of thematic error in apartment complexes may be necessary to sufficiently characterize the quality, but may also be significantly ambiguous •Best guess in determining whether it is a thematic error (wrong address number) or another type of error (positional accuracy error, logical consistency anomaly) is suspect esp. / Governor's Office of Information Technology ~ Executive Leadership Team w/apartments
  • 25. Data Quality - Thematic Accuracy Governor's Office of Information Technology ~ Executive Leadership Team