SlideShare a Scribd company logo
1 of 19
Download to read offline
Data Cleaning
and Data
Publishing
Workshop
2013 18-20
February,
Nairobi, Kenya
Javier Otegui
@jotegui
PRIMARY DATA
PRECISION
 Primary Biodiversity Data
 Information directly collected in the field
 What has been collected where
 When is also important for multiple uses
 Types of PBD - occurrences
 Museum specimens
 Field observations
 Fossils, literature, germplasm…
DEFINITIONS - PBD
 Precision
 Closeness of repeated measurements to the same value
 Accuracy
 Closeness of a measurement to the true value
DEFINITIONS – PRECISION AND
ACCURACY
DEFINITIONS – PRECISION AND
ACCURACY
 Precision
 Closeness of repeated measurements to the same value
 Accuracy
 Closeness of a measurement to the true value
 Accuracy depends on knowing the true value of the
variable
 Precision is an intrinsic value
DEFINITIONS – PRECISION AND
ACCURACY
 Geospatial data best representation: coordinates
 Standard way of reporting coordinates: decimal degree
 Precision in geospatial data ~ precision in coordinates
 Geospatial information in several formats
GEOSPATIAL DATA - PRECISION
-1.2, 36.8 -1.2219, 36.8967
GEOSPATIAL DATA - PRECISION
55.932576, 13.132359
Anahuac NWR (UTC 049)
Grandville
POINT(-1.3223333 53.44958)
Marine Nature Study Area
78º 47’ 52” S; 35º 50’ 31” E
Stewart Park
POINT(-1.1735004 53.358746)
Backyard
My Habitat
55.932576, 13.132359
Wilderness Park, north of 14th St.
28054
Delaney Conservation Area
57.3, 11.9
…
 Geospatial data best representation: coordinates
 Standard way of reporting coordinates: decimal degree
 Precision in geospatial data ~ precision in coordinates
 Geospatial information in several formats
 Low precision in the original data is impossible to solve a
posteriori
GEOSPATIAL DATA - PRECISION
 Sometimes, low precision data is encouraged
 Endangered species
 Commercially interesting species
 …
 Sensitive data should not be directly available in high-
resolution (precise) format
 Good practice: Provide low precision information but keep the
original high-precision data
 Level of imprecision depends on level of threat
GEOSPATIAL DATA - PRECISION
 Low precision = reduced usability of the data
 Low accuracy = wrong conclusions if used without
caution
 Causes:
 Malfunction of devices
 Wrong interpretation in transformations between systems
 Issues in digitization
GEOSPATIAL DATA - ACCURACY
 Transformations are prone to errors if not handled carefully
 Wrong formula when converting DMS to DD
 Wrong datum when converting UTM to DD
 Issues at the time of digitization
 Transposition of coordinates 45.34, -9.16 => -9.16, 45.34
 Forget the minus sign 45.34, -9.16 => 45.34, 9.16
 Use comma instead of period 45.34, -9.16 => 45,34, -9,16
 Transform coordinates to zero 45.34, -9.16 => 0,0
 …
 Some methods could reduce precision to gain accuracy
GEOSPATIAL DATA - ACCURACY
 Precision as completion of higher taxonomic levels
 Depends on the lowest taxonomic rank that has
information
 Lowest level = genus, fairly precise, broader usability
 Lowest level = class, poorly precise, narrow usability
 Threshold depends on several factors:
 Scope of the analyses
 Taxonomic group
TAXONOMIC DATA - PRECISION
 Mainly due to one of these two:
 Use of a wrong taxonomy
 Inaccuracies in the identification
 Taxonomic hierarchies are subjective and different
taxonomies exist
 With poor data, incomplete data, how to rely on
identification?
 “Taxonomic assessments” section
TAXONOMIC DATA - ACCURACY
 Wrong identification of organism, due to:
 Lack of expertise
 Bad identification environment
 Expert curation needed to improve reliability of
identification
 Annotations, flags, reviews… don’t prevent the issue,
but help to its resolution
 Different PBD types = different reliability
 Museum specimens – reviewable, higher reliability
 Field observations – reliability depends on expertise, non-
reviewable
TAXONOMIC DATA - ACCURACY
 Precision refers to degree of completion of elements
 DarwinCore Standard recommends ISO 8601
 Wide range of date formats
 Canonical: YYYY-MM-DD
 Reduced formats: YYYY-MM, YYYY, CC
 Problems:
 Usability of low-precision dates
 Ambiguity of some formats: 19 = 1919 / XIX / 2019?
 Solution relies on solid date parsers or human interpretation
 Parsers: Hard to build
 Humans: get overwhelmed too easily
TEMPORAL DATA - PRECISION
 Element swapping
 Information in the wrong field
 Sometimes self-detectable – 2012-19-02
 Best solution: go back to the original data
 Misspellings
 Date shrink – 1996 = 196
 Date change – 1996 = 1986
 Again, best solution: go back to the original data
TEMPORAL DATA - ACCURACY
 Low precision – reduce range of possible uses
 Usable for many applications
 Geospatial – regional or national checklists
 Taxonomic – large group assessments
 Temporal – large-scale assessments, such as climate change
 Still, a minimum precision is required
 Low accuracy – can lead to wrong conclusions
 Reducing precision can mask inaccuracies
 Inaccuracies might be hard to spot
 Collating accurate and inaccurate data for error detection
GENERAL IMPLICATIONS
GENERAL IMPLICATIONS
Year
1996
1997
1996
1998
1998
1998
1986
1997
1995
1996
1998
1997
1998
1996
…
Only record outside 1995-1998
A little is better than nothing
Absence of data could be seen as better than
wrong data
Vague and/or wrong data, together with good
data can help in detecting issues
CONCLUSION

More Related Content

Similar to ASSESSMENTS-Primary-Data-Precision-Javier

DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3asad199
 
Essential Data Science for Product Designers and Non-Scientists
Essential Data Science for Product Designers and Non-ScientistsEssential Data Science for Product Designers and Non-Scientists
Essential Data Science for Product Designers and Non-ScientistsJames Christopher
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningShivarkarSandip
 
Computer Vision for Measurement & FR
Computer Vision for Measurement & FRComputer Vision for Measurement & FR
Computer Vision for Measurement & FRRekaNext Capital
 
(Lidar) Pan Australia Topo Mapping Q1 2018
(Lidar) Pan Australia Topo Mapping Q1 2018(Lidar) Pan Australia Topo Mapping Q1 2018
(Lidar) Pan Australia Topo Mapping Q1 2018Brett Johnson
 

Similar to ASSESSMENTS-Primary-Data-Precision-Javier (20)

data-mining-tutorial.ppt
data-mining-tutorial.pptdata-mining-tutorial.ppt
data-mining-tutorial.ppt
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 
My3prep
My3prepMy3prep
My3prep
 
Essential Data Science for Product Designers and Non-Scientists
Essential Data Science for Product Designers and Non-ScientistsEssential Data Science for Product Designers and Non-Scientists
Essential Data Science for Product Designers and Non-Scientists
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Computer Vision for Measurement & FR
Computer Vision for Measurement & FRComputer Vision for Measurement & FR
Computer Vision for Measurement & FR
 
Final ies
Final iesFinal ies
Final ies
 
Remote Sensing ppt
Remote Sensing pptRemote Sensing ppt
Remote Sensing ppt
 
(Lidar) Pan Australia Topo Mapping Q1 2018
(Lidar) Pan Australia Topo Mapping Q1 2018(Lidar) Pan Australia Topo Mapping Q1 2018
(Lidar) Pan Australia Topo Mapping Q1 2018
 

ASSESSMENTS-Primary-Data-Precision-Javier

  • 1. Data Cleaning and Data Publishing Workshop 2013 18-20 February, Nairobi, Kenya Javier Otegui @jotegui PRIMARY DATA PRECISION
  • 2.  Primary Biodiversity Data  Information directly collected in the field  What has been collected where  When is also important for multiple uses  Types of PBD - occurrences  Museum specimens  Field observations  Fossils, literature, germplasm… DEFINITIONS - PBD
  • 3.  Precision  Closeness of repeated measurements to the same value  Accuracy  Closeness of a measurement to the true value DEFINITIONS – PRECISION AND ACCURACY
  • 5.  Precision  Closeness of repeated measurements to the same value  Accuracy  Closeness of a measurement to the true value  Accuracy depends on knowing the true value of the variable  Precision is an intrinsic value DEFINITIONS – PRECISION AND ACCURACY
  • 6.  Geospatial data best representation: coordinates  Standard way of reporting coordinates: decimal degree  Precision in geospatial data ~ precision in coordinates  Geospatial information in several formats GEOSPATIAL DATA - PRECISION -1.2, 36.8 -1.2219, 36.8967
  • 7. GEOSPATIAL DATA - PRECISION 55.932576, 13.132359 Anahuac NWR (UTC 049) Grandville POINT(-1.3223333 53.44958) Marine Nature Study Area 78º 47’ 52” S; 35º 50’ 31” E Stewart Park POINT(-1.1735004 53.358746) Backyard My Habitat 55.932576, 13.132359 Wilderness Park, north of 14th St. 28054 Delaney Conservation Area 57.3, 11.9 …
  • 8.  Geospatial data best representation: coordinates  Standard way of reporting coordinates: decimal degree  Precision in geospatial data ~ precision in coordinates  Geospatial information in several formats  Low precision in the original data is impossible to solve a posteriori GEOSPATIAL DATA - PRECISION
  • 9.  Sometimes, low precision data is encouraged  Endangered species  Commercially interesting species  …  Sensitive data should not be directly available in high- resolution (precise) format  Good practice: Provide low precision information but keep the original high-precision data  Level of imprecision depends on level of threat GEOSPATIAL DATA - PRECISION
  • 10.  Low precision = reduced usability of the data  Low accuracy = wrong conclusions if used without caution  Causes:  Malfunction of devices  Wrong interpretation in transformations between systems  Issues in digitization GEOSPATIAL DATA - ACCURACY
  • 11.  Transformations are prone to errors if not handled carefully  Wrong formula when converting DMS to DD  Wrong datum when converting UTM to DD  Issues at the time of digitization  Transposition of coordinates 45.34, -9.16 => -9.16, 45.34  Forget the minus sign 45.34, -9.16 => 45.34, 9.16  Use comma instead of period 45.34, -9.16 => 45,34, -9,16  Transform coordinates to zero 45.34, -9.16 => 0,0  …  Some methods could reduce precision to gain accuracy GEOSPATIAL DATA - ACCURACY
  • 12.  Precision as completion of higher taxonomic levels  Depends on the lowest taxonomic rank that has information  Lowest level = genus, fairly precise, broader usability  Lowest level = class, poorly precise, narrow usability  Threshold depends on several factors:  Scope of the analyses  Taxonomic group TAXONOMIC DATA - PRECISION
  • 13.  Mainly due to one of these two:  Use of a wrong taxonomy  Inaccuracies in the identification  Taxonomic hierarchies are subjective and different taxonomies exist  With poor data, incomplete data, how to rely on identification?  “Taxonomic assessments” section TAXONOMIC DATA - ACCURACY
  • 14.  Wrong identification of organism, due to:  Lack of expertise  Bad identification environment  Expert curation needed to improve reliability of identification  Annotations, flags, reviews… don’t prevent the issue, but help to its resolution  Different PBD types = different reliability  Museum specimens – reviewable, higher reliability  Field observations – reliability depends on expertise, non- reviewable TAXONOMIC DATA - ACCURACY
  • 15.  Precision refers to degree of completion of elements  DarwinCore Standard recommends ISO 8601  Wide range of date formats  Canonical: YYYY-MM-DD  Reduced formats: YYYY-MM, YYYY, CC  Problems:  Usability of low-precision dates  Ambiguity of some formats: 19 = 1919 / XIX / 2019?  Solution relies on solid date parsers or human interpretation  Parsers: Hard to build  Humans: get overwhelmed too easily TEMPORAL DATA - PRECISION
  • 16.  Element swapping  Information in the wrong field  Sometimes self-detectable – 2012-19-02  Best solution: go back to the original data  Misspellings  Date shrink – 1996 = 196  Date change – 1996 = 1986  Again, best solution: go back to the original data TEMPORAL DATA - ACCURACY
  • 17.  Low precision – reduce range of possible uses  Usable for many applications  Geospatial – regional or national checklists  Taxonomic – large group assessments  Temporal – large-scale assessments, such as climate change  Still, a minimum precision is required  Low accuracy – can lead to wrong conclusions  Reducing precision can mask inaccuracies  Inaccuracies might be hard to spot  Collating accurate and inaccurate data for error detection GENERAL IMPLICATIONS
  • 19. A little is better than nothing Absence of data could be seen as better than wrong data Vague and/or wrong data, together with good data can help in detecting issues CONCLUSION