SlideShare a Scribd company logo
1 of 24
Download to read offline
Data Cleaning
and Data
Publishing
Workshop
2013 18-22
February,
Nairobi, Kenya
Javier Otegui
@jotegui
ERROR FLAGGING
¡ What is flagging?
§ Adding a piece of information to a record or PBD
§ Give extra information on something
§ Especially used to highlight records to inform collector or user
¡ Aims of error flagging:
§ Provide a simple way of filtering records that might be
problematic
§ Very useful for automated error processing
§ Reporting issues to the owner
¡ Difference between flagging and resolving:
§ Ownership
INTRODUCTION – ERROR FLAGGING
DATA IS OURS
¡ We are directly
responsible for the
quality
¡ We may share the
master copy of the data
¡ We can directly improve
the quality of the data
and serve it
DATA IS NOT OURS
¡ We are not directly
responsible for the
quality
¡ We point to the original
source
¡ We cannot directly
improve the quality of
the data and serve it
INTRODUCTION – OWNERSHIP
¡ What is flagging?
§ Adding a piece of information to a record or PBD
§ Give extra information on something
§ Especially used to highlight records to inform collector or user
¡ Aims of error flagging:
§ Provide a simple way of filtering records that might be
problematic
§ Very useful for automated error processing
§ Reporting issues to the owner
¡ Difference between flagging and resolving:
§ Ownership
¡ Why flag and not resolve? Attribution and persistence
INTRODUCTION – ERROR FLAGGING
¡  Data from an aggregator – certain restrictions or conditions
¡  Acknowledge the original source of the data
¡  Each collection might have additional rules
INTRODUCTION - ATTRIBUTION
¡  Data from an aggregator – certain restrictions or conditions
¡  Acknowledge the original source of the data
¡  Each collection might have additional rules
INTRODUCTION - ATTRIBUTION
¡  Data from an aggregator – certain restrictions or conditions
¡  Acknowledge the original source of the data
¡  Each collection might have additional rules
INTRODUCTION - ATTRIBUTION
¡ Persistence of the correction
¡ Local work = no permanence of corrections
¡ Next researcher must repeat the cleaning process
¡ Error flagging as an excellent tool for reporting
issues
¡ Once reported, owners can clean the data
¡ Example or flagging: annotations
INTRODUCTION - PERSISTENCE
¡  Data manipulation – add a piece of information to the original
record
¡  New fields, populated if an issue is detected
¡  Recommendation: use (and document) a codification
INTRODUCTION - MECHANISMS
Coordinates swapped
Swapped coordinates
Coordinates transposed
Coordnates transposed
…
1
1
1
1
1
¡  Data Usage Terms
§  Accepted when using the portal
§  Among others, the need to cite the data
¡  Data Sharing Agreement
§  “GBIF Secretariat may cache a copy and serve full or partial data
further to other users together with the terms and conditions for use
set by the Data Publisher”
§  Partial based on detected issues in the quality
¡  How do they detect issues?
§  Processing routines search for most common issues
§  Errors are flagged – They cannot alter the data
§  Flags used to alert users and reported back to owners
INTRODUCTION – EXAMPLE: GBIF
INTRODUCTION – EXAMPLE: GBIF
Coordinates fall outside specified
country, territory or island
INTRODUCTION – EXAMPLE: GBIF
138,458 records with coordinates
138,312 records in map
146 records with wrong coordinates
¡ What happens when errors are flagged?
¡ Flags or annotations should reach the owner
¡ Owner is the only one who can solve issues at the
source
¡ Corrected data is then deployed and re-indexed
¡ This has happened often…
INTRODUCTION – RESOLUTION PATH
INTRODUCTION – RESOLUTION PATH
Before
After
¡ Key factor: awareness and implication of data owners
§ Some owners correct their data
§ Some owners don’t
¡ Without this step, the process of error flagging loses
part of its sense
INTRODUCTION – RESOLUTION PATH
¡ Error flagging can be applied to several data storage
formats
¡ Each format has its own requirements
¡ Formats:
§ Text files: tab-delimited, CSV files…
§ Spreadsheets: LibreOffice Calc, Google Spreadsheets, Microsoft
Office…
§ Database tables
ERROR FLAGGING
¡ On some aspects, the most comfortable way of
managing data
¡ Semi-structured, visual management of information
§ Rows, columns and cells
§ Not determined to hold any specific type of data
§ Plotting records in several ways
¡ Calculations with cells
¡ Some of the most common operations:
ERROR FLAGGING – SPREADSHEETS
¡  Sorting
ERROR FLAGGING – SPREADSHEETS
¡  Filtering
ERROR FLAGGING – SPREADSHEETS
¡  Conditional formatting
ERROR FLAGGING – SPREADSHEETS
¡  Controlled vocabulary
ERROR FLAGGING – SPREADSHEETS
¡  Visualizations
ERROR FLAGGING – SPREADSHEETS
¡  Formulae & Advanced scripting
ERROR FLAGGING – SPREADSHEETS
¡ Error flagging – the process of reporting
issues without modifying the original data
¡ Useful when working with shared data
¡ In Spreadsheets
§ Simple, yet powerful
§ Adaptable levels of difficulty
§ Several possibilities to filter and flag records
CONCLUSION

More Related Content

Viewers also liked

Viewers also liked (14)

A. Mentella ifu-hydro
A. Mentella ifu-hydroA. Mentella ifu-hydro
A. Mentella ifu-hydro
 
Roll forming
Roll formingRoll forming
Roll forming
 
Linking systems to improve data quality
Linking systems to improve data qualityLinking systems to improve data quality
Linking systems to improve data quality
 
Bespoke metal fabrication
Bespoke metal fabricationBespoke metal fabrication
Bespoke metal fabrication
 
Presentation sika profili
Presentation sika profiliPresentation sika profili
Presentation sika profili
 
Aroma Barrier & Low Extractables
Aroma Barrier & Low ExtractablesAroma Barrier & Low Extractables
Aroma Barrier & Low Extractables
 
Tube hydroforming
Tube hydroformingTube hydroforming
Tube hydroforming
 
Incremental sheet metal forming - Incremental Single Point
Incremental sheet metal forming - Incremental Single PointIncremental sheet metal forming - Incremental Single Point
Incremental sheet metal forming - Incremental Single Point
 
LIQUID EXPLOSIVE
LIQUID EXPLOSIVELIQUID EXPLOSIVE
LIQUID EXPLOSIVE
 
Packaging Options With COC/PE Blends
Packaging Options With COC/PE BlendsPackaging Options With COC/PE Blends
Packaging Options With COC/PE Blends
 
Deep Drowing
Deep Drowing Deep Drowing
Deep Drowing
 
4th 2 lecture shear and moment diagram structure i
4th 2  lecture shear and moment diagram structure i4th 2  lecture shear and moment diagram structure i
4th 2 lecture shear and moment diagram structure i
 
Hydro forming
Hydro formingHydro forming
Hydro forming
 
Metal spinning Process
Metal spinning ProcessMetal spinning Process
Metal spinning Process
 

Similar to CLEANING-Error-Flagging-Javier

Dynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker EnvironmentsDynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker EnvironmentsTorin Sandall
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
CCXG Workshop, February 2021, Michael Vartanyan
CCXG Workshop, February 2021, Michael VartanyanCCXG Workshop, February 2021, Michael Vartanyan
CCXG Workshop, February 2021, Michael VartanyanOECD Environment
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, CriteoParis Open Source Summit
 
Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data IntegrationRoberto Marchetto
 
CCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael VartanyanCCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael VartanyanOECD Environment
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Baltimore share point user group june 2015
Baltimore share point user group june 2015Baltimore share point user group june 2015
Baltimore share point user group june 2015Toby McGrail
 
High Performance and Scalability Database Design
High Performance and Scalability Database DesignHigh Performance and Scalability Database Design
High Performance and Scalability Database DesignTung Ns
 
Big Data Expo 2015 - HP Information Management & Governance
Big Data Expo 2015 - HP Information Management & GovernanceBig Data Expo 2015 - HP Information Management & Governance
Big Data Expo 2015 - HP Information Management & GovernanceBigDataExpo
 
Code Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data PipelinesCode Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data PipelinesDatabricks
 
Data architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategyData architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategyCloverDX
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Loading Huge Amounts of Data
Loading Huge Amounts of DataLoading Huge Amounts of Data
Loading Huge Amounts of DataVaticle
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Scaling systems using change propagation across data stores
Scaling systems using change propagation across data storesScaling systems using change propagation across data stores
Scaling systems using change propagation across data storesJagadeesh Huliyar
 
The Art of Requesting Data from IT
The Art of Requesting Data from ITThe Art of Requesting Data from IT
The Art of Requesting Data from ITBrad Adams
 

Similar to CLEANING-Error-Flagging-Javier (20)

Dynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker EnvironmentsDynamic Authorization & Policy Control for Docker Environments
Dynamic Authorization & Policy Control for Docker Environments
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
CCXG Workshop, February 2021, Michael Vartanyan
CCXG Workshop, February 2021, Michael VartanyanCCXG Workshop, February 2021, Michael Vartanyan
CCXG Workshop, February 2021, Michael Vartanyan
 
Tracking data lineage at Stitch Fix
Tracking data lineage at Stitch FixTracking data lineage at Stitch Fix
Tracking data lineage at Stitch Fix
 
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
#OSSPARIS19 - How to improve database observability - CHARLES JUDITH, Criteo
 
Talend Open Studio Data Integration
Talend Open Studio Data IntegrationTalend Open Studio Data Integration
Talend Open Studio Data Integration
 
CCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael VartanyanCCXG Special Event, November 2020, Michael Vartanyan
CCXG Special Event, November 2020, Michael Vartanyan
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Baltimore share point user group june 2015
Baltimore share point user group june 2015Baltimore share point user group june 2015
Baltimore share point user group june 2015
 
High Performance and Scalability Database Design
High Performance and Scalability Database DesignHigh Performance and Scalability Database Design
High Performance and Scalability Database Design
 
Big Data Expo 2015 - HP Information Management & Governance
Big Data Expo 2015 - HP Information Management & GovernanceBig Data Expo 2015 - HP Information Management & Governance
Big Data Expo 2015 - HP Information Management & Governance
 
Code Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data PipelinesCode Once Use Often with Declarative Data Pipelines
Code Once Use Often with Declarative Data Pipelines
 
Data architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategyData architecture principles to accelerate your data strategy
Data architecture principles to accelerate your data strategy
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Loading Huge Amounts of Data
Loading Huge Amounts of DataLoading Huge Amounts of Data
Loading Huge Amounts of Data
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Scaling systems using change propagation across data stores
Scaling systems using change propagation across data storesScaling systems using change propagation across data stores
Scaling systems using change propagation across data stores
 
The Art of Requesting Data from IT
The Art of Requesting Data from ITThe Art of Requesting Data from IT
The Art of Requesting Data from IT
 

More from Javier Otegui

Highlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity DataHighlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity DataJavier Otegui
 
CLEANING-Data-Transformation-Javier
CLEANING-Data-Transformation-JavierCLEANING-Data-Transformation-Javier
CLEANING-Data-Transformation-JavierJavier Otegui
 
ASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-JavierASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-JavierJavier Otegui
 
ASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-JavierASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-JavierJavier Otegui
 
Haciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open ScienceHaciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open ScienceJavier Otegui
 
Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?Javier Otegui
 

More from Javier Otegui (6)

Highlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity DataHighlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity Data
 
CLEANING-Data-Transformation-Javier
CLEANING-Data-Transformation-JavierCLEANING-Data-Transformation-Javier
CLEANING-Data-Transformation-Javier
 
ASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-JavierASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-Javier
 
ASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-JavierASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-Javier
 
Haciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open ScienceHaciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open Science
 
Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?
 

CLEANING-Error-Flagging-Javier

  • 1. Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui ERROR FLAGGING
  • 2. ¡ What is flagging? § Adding a piece of information to a record or PBD § Give extra information on something § Especially used to highlight records to inform collector or user ¡ Aims of error flagging: § Provide a simple way of filtering records that might be problematic § Very useful for automated error processing § Reporting issues to the owner ¡ Difference between flagging and resolving: § Ownership INTRODUCTION – ERROR FLAGGING
  • 3. DATA IS OURS ¡ We are directly responsible for the quality ¡ We may share the master copy of the data ¡ We can directly improve the quality of the data and serve it DATA IS NOT OURS ¡ We are not directly responsible for the quality ¡ We point to the original source ¡ We cannot directly improve the quality of the data and serve it INTRODUCTION – OWNERSHIP
  • 4. ¡ What is flagging? § Adding a piece of information to a record or PBD § Give extra information on something § Especially used to highlight records to inform collector or user ¡ Aims of error flagging: § Provide a simple way of filtering records that might be problematic § Very useful for automated error processing § Reporting issues to the owner ¡ Difference between flagging and resolving: § Ownership ¡ Why flag and not resolve? Attribution and persistence INTRODUCTION – ERROR FLAGGING
  • 5. ¡  Data from an aggregator – certain restrictions or conditions ¡  Acknowledge the original source of the data ¡  Each collection might have additional rules INTRODUCTION - ATTRIBUTION
  • 6. ¡  Data from an aggregator – certain restrictions or conditions ¡  Acknowledge the original source of the data ¡  Each collection might have additional rules INTRODUCTION - ATTRIBUTION
  • 7. ¡  Data from an aggregator – certain restrictions or conditions ¡  Acknowledge the original source of the data ¡  Each collection might have additional rules INTRODUCTION - ATTRIBUTION
  • 8. ¡ Persistence of the correction ¡ Local work = no permanence of corrections ¡ Next researcher must repeat the cleaning process ¡ Error flagging as an excellent tool for reporting issues ¡ Once reported, owners can clean the data ¡ Example or flagging: annotations INTRODUCTION - PERSISTENCE
  • 9. ¡  Data manipulation – add a piece of information to the original record ¡  New fields, populated if an issue is detected ¡  Recommendation: use (and document) a codification INTRODUCTION - MECHANISMS Coordinates swapped Swapped coordinates Coordinates transposed Coordnates transposed … 1 1 1 1 1
  • 10. ¡  Data Usage Terms §  Accepted when using the portal §  Among others, the need to cite the data ¡  Data Sharing Agreement §  “GBIF Secretariat may cache a copy and serve full or partial data further to other users together with the terms and conditions for use set by the Data Publisher” §  Partial based on detected issues in the quality ¡  How do they detect issues? §  Processing routines search for most common issues §  Errors are flagged – They cannot alter the data §  Flags used to alert users and reported back to owners INTRODUCTION – EXAMPLE: GBIF
  • 11. INTRODUCTION – EXAMPLE: GBIF Coordinates fall outside specified country, territory or island
  • 12. INTRODUCTION – EXAMPLE: GBIF 138,458 records with coordinates 138,312 records in map 146 records with wrong coordinates
  • 13. ¡ What happens when errors are flagged? ¡ Flags or annotations should reach the owner ¡ Owner is the only one who can solve issues at the source ¡ Corrected data is then deployed and re-indexed ¡ This has happened often… INTRODUCTION – RESOLUTION PATH
  • 14. INTRODUCTION – RESOLUTION PATH Before After
  • 15. ¡ Key factor: awareness and implication of data owners § Some owners correct their data § Some owners don’t ¡ Without this step, the process of error flagging loses part of its sense INTRODUCTION – RESOLUTION PATH
  • 16. ¡ Error flagging can be applied to several data storage formats ¡ Each format has its own requirements ¡ Formats: § Text files: tab-delimited, CSV files… § Spreadsheets: LibreOffice Calc, Google Spreadsheets, Microsoft Office… § Database tables ERROR FLAGGING
  • 17. ¡ On some aspects, the most comfortable way of managing data ¡ Semi-structured, visual management of information § Rows, columns and cells § Not determined to hold any specific type of data § Plotting records in several ways ¡ Calculations with cells ¡ Some of the most common operations: ERROR FLAGGING – SPREADSHEETS
  • 20. ¡  Conditional formatting ERROR FLAGGING – SPREADSHEETS
  • 21. ¡  Controlled vocabulary ERROR FLAGGING – SPREADSHEETS
  • 23. ¡  Formulae & Advanced scripting ERROR FLAGGING – SPREADSHEETS
  • 24. ¡ Error flagging – the process of reporting issues without modifying the original data ¡ Useful when working with shared data ¡ In Spreadsheets § Simple, yet powerful § Adaptable levels of difficulty § Several possibilities to filter and flag records CONCLUSION