SlideShare a Scribd company logo
Creating Clean Data for Publishing
1
(Phase 2)
2
Background
Clean data for analysis is often not equivalent to clean data for archiving. An archive
ready dataset contains features of the processes of future use and revisioning.
Furthermore the concept of clean varies with data type (e.g. table, image, vector,
code).
3
Here is the greenish title slide
Objectives
Discuss best practices for formatting a tabular dataset to make it ready to archive.
Identify activities associated with QA/QC.
Tabular data for archiving
Goal is to store the data so that they can be used in automated ways, with minimal
human intervention:
● Create meaningful data structure (tidy data)
○ Easy to maintain, analyze and reuse
○ Each column = a variable; each row = an observation
● Compile error free data
○ QA/QC
○ Consistent data in terms of format or accuracy; impossible values; sensor
drift
4
Tabular data for archiving
… are often in a different form than data for analysis and presentation.
For example, spreadsheets are frequently organized in complex form,
comprehensible by the “eye”, or data are prepared as input to specialized software.
Archival formats require long-term readability by computers (simple, consistent
format)
5
Precipitation in Four Watersheds by Date
Human-readable vs. archive-ready
6
“Long” format
“Wide” format
What’s wrong with this spreadsheet?
7
What’s wrong with this spreadsheet?
8
What’s wrong with this spreadsheet?
8
Mini-tables and data
have inconsistencies
What’s wrong with this spreadsheet?
9
9
Date formats
differ
What’s wrong with this spreadsheet?
10
Codes are
inconsistent!
Plants have flowers,
fruit, both, or just leaves
Does Fr+Flwr
mean the same
as FF?
Codes are inconsistent
What’s wrong with this spreadsheet?
11
Summary
information
mixed with
raw data!
11
Summary information is
mixed with raw data
What’s wrong with this spreadsheet?
12
Text data is mixed
with numeric data
What’s wrong with this spreadsheet?
13
Sort on Species
Tidy Phenology Data
14
14
● Each row = an observation
● Each column = a variable
● This file is easy to maintain and
use
● Date are computer-readable
● Structure is easy to describe in
metadata
How would you make these data tidy?
15
Make these data tidy?
Add a Site column!
16
Best practices for tabular data
Some best practices for formatting tabular data:
● File names
● Column names
● Date and time formats
● One value per cell
● Missing value codes
● Flag columns
● Quality Assurance/Quality Control (QA/QC)
17
Best practice: File names
Use descriptive file names (what,where,when)
● Bad file name: PlotData.xlsx
● Good file name: FCE_SawgrassNPP_2019.xlsx
Store data in a non-proprietary format:
● Excel -> .csv
● Word -> .pdf
18
Best practice: Column names
● Single header row with column names
● Column names should start with a letter and not include spaces or symbols
(other than the underscore (e.g., soil_temperature)
● +,-,*,&,^ are often treated as operators and so should not be used in column
names
● Don’t include units or definition of the variable
19
Bad Column Name Good Column Name
DOC Concentration (mg/ml) DOC_Concentration
Fruit/Flower FruitFlower or Fruit_Flower
Fine earth subsample mass, after oven-
drying (g)
FineEarthSubMass
Best practice: Date and time formats
● 02-03-04 means February 3, 2004 in the US, but the order of month, day, year
is ambiguous to others.
● 02-03-04 might look like March 4, 2002 in other countries.
ISO 8601 Standard:
● YYYY-MM-DD 2020-05-28
● YYYY-MM-DD hh:mm:ss 2020-05-28 15:52:38
Best practices for date and time formats
20
Best practice: One value per cell
21
An experiment is replicated at three sites, with six plots per site
Best practice: Missing value codes
● Differentiate between “0” and “no observation” (no empty cells)
● Possible values: -9999, NA, NULL, NaN and others
● Explain the missing value code in metadata
22
Best practice: Flag columns
23
Best practice: QA/QC
Quality assurance: process-oriented
● Well-designed data sheet
● Training field technicians
Quality control: product-oriented (tests of data for quality)
● Consistent codes
● Consistent date formats
● more...
24
Best practice: QA/QC
● Range checks
● Sanity checks
● Duplicate observations
● Sensor drift
● Data spikes
● Comparison with nearby stations
● Graphing
25
26
Here is the greenish title slide
Summary
● One header row with variable names.
● Descriptive and consistent names for variables (start with a letter, no spaces or
symbols, use underscores, no mathematical operators +,-,*,&,^).
● Each variable one column, each cell one value
● Each column should include values for a single variable.
● Each cell should include one value for one variable.
● Each column should include only a single type of data (character, numeric).
● Lines or rows of data should be complete, without empty cells.
● Flags or comments to qualify or describe data when needed to give meaning.
27
Here is the greenish title slide
References
Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and
Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp.
138-141.
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The
American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
Wickham, Hadley (2014) Tidy Data. Journal of Statistical Software. 59: 1-23.

More Related Content

What's hot

Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
Matteo Manca
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Database
DatabaseDatabase
Database
Chinmay Raul
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
ColorfullMedia
 
Elementary data organisation
Elementary data organisationElementary data organisation
Elementary data organisationMuzamil Hussain
 
Data structure
Data  structureData  structure
Data structure
priyanka belekar
 
20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary
Dave Vieglais
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
Amir Masoud Sefidian
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...
Leon Osinski
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structures
NagajothiN1
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
datacite
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)
mhb120
 
Schema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive DataSchema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive Data
Lars Gleim
 
Big Data - How important it is
Big Data - How important it isBig Data - How important it is
Big Data - How important it is
Adrian Pizarro Serrano
 
A basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyA basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and why
Leon Osinski
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data Publishing
Varsha Khodiyar
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
DataminingTools Inc
 

What's hot (19)

Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Database
DatabaseDatabase
Database
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
Elementary data organisation
Elementary data organisationElementary data organisation
Elementary data organisation
 
Data structure
Data  structureData  structure
Data structure
 
20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structures
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)
 
Schema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive DataSchema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive Data
 
Big Data - How important it is
Big Data - How important it isBig Data - How important it is
Big Data - How important it is
 
A basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyA basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and why
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data Publishing
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 

Similar to EDI Training Module 5: Creating Clean Data foro Publishing

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
Omid Vahdaty
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
sankirtishiravale
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
Christopher Eaker
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
Christopher Eaker
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
anita desiani
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
Anaya Zafar
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
example43
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
Lex Toumbourou
 
Lecture 1 and 2
Lecture 1 and 2Lecture 1 and 2
Lecture 1 and 2
SaheedTundeZubairSTA
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
AnwarrChaudary
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 

Similar to EDI Training Module 5: Creating Clean Data foro Publishing (20)

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Lecture 1 and 2
Lecture 1 and 2Lecture 1 and 2
Lecture 1 and 2
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 

EDI Training Module 5: Creating Clean Data foro Publishing

  • 1. Creating Clean Data for Publishing 1 (Phase 2)
  • 2. 2 Background Clean data for analysis is often not equivalent to clean data for archiving. An archive ready dataset contains features of the processes of future use and revisioning. Furthermore the concept of clean varies with data type (e.g. table, image, vector, code).
  • 3. 3 Here is the greenish title slide Objectives Discuss best practices for formatting a tabular dataset to make it ready to archive. Identify activities associated with QA/QC.
  • 4. Tabular data for archiving Goal is to store the data so that they can be used in automated ways, with minimal human intervention: ● Create meaningful data structure (tidy data) ○ Easy to maintain, analyze and reuse ○ Each column = a variable; each row = an observation ● Compile error free data ○ QA/QC ○ Consistent data in terms of format or accuracy; impossible values; sensor drift 4
  • 5. Tabular data for archiving … are often in a different form than data for analysis and presentation. For example, spreadsheets are frequently organized in complex form, comprehensible by the “eye”, or data are prepared as input to specialized software. Archival formats require long-term readability by computers (simple, consistent format) 5
  • 6. Precipitation in Four Watersheds by Date Human-readable vs. archive-ready 6 “Long” format “Wide” format
  • 7. What’s wrong with this spreadsheet? 7
  • 8. What’s wrong with this spreadsheet? 8 What’s wrong with this spreadsheet? 8 Mini-tables and data have inconsistencies
  • 9. What’s wrong with this spreadsheet? 9 9 Date formats differ
  • 10. What’s wrong with this spreadsheet? 10 Codes are inconsistent! Plants have flowers, fruit, both, or just leaves Does Fr+Flwr mean the same as FF? Codes are inconsistent
  • 11. What’s wrong with this spreadsheet? 11 Summary information mixed with raw data! 11 Summary information is mixed with raw data
  • 12. What’s wrong with this spreadsheet? 12 Text data is mixed with numeric data
  • 13. What’s wrong with this spreadsheet? 13 Sort on Species
  • 14. Tidy Phenology Data 14 14 ● Each row = an observation ● Each column = a variable ● This file is easy to maintain and use ● Date are computer-readable ● Structure is easy to describe in metadata
  • 15. How would you make these data tidy? 15
  • 16. Make these data tidy? Add a Site column! 16
  • 17. Best practices for tabular data Some best practices for formatting tabular data: ● File names ● Column names ● Date and time formats ● One value per cell ● Missing value codes ● Flag columns ● Quality Assurance/Quality Control (QA/QC) 17
  • 18. Best practice: File names Use descriptive file names (what,where,when) ● Bad file name: PlotData.xlsx ● Good file name: FCE_SawgrassNPP_2019.xlsx Store data in a non-proprietary format: ● Excel -> .csv ● Word -> .pdf 18
  • 19. Best practice: Column names ● Single header row with column names ● Column names should start with a letter and not include spaces or symbols (other than the underscore (e.g., soil_temperature) ● +,-,*,&,^ are often treated as operators and so should not be used in column names ● Don’t include units or definition of the variable 19 Bad Column Name Good Column Name DOC Concentration (mg/ml) DOC_Concentration Fruit/Flower FruitFlower or Fruit_Flower Fine earth subsample mass, after oven- drying (g) FineEarthSubMass
  • 20. Best practice: Date and time formats ● 02-03-04 means February 3, 2004 in the US, but the order of month, day, year is ambiguous to others. ● 02-03-04 might look like March 4, 2002 in other countries. ISO 8601 Standard: ● YYYY-MM-DD 2020-05-28 ● YYYY-MM-DD hh:mm:ss 2020-05-28 15:52:38 Best practices for date and time formats 20
  • 21. Best practice: One value per cell 21 An experiment is replicated at three sites, with six plots per site
  • 22. Best practice: Missing value codes ● Differentiate between “0” and “no observation” (no empty cells) ● Possible values: -9999, NA, NULL, NaN and others ● Explain the missing value code in metadata 22
  • 23. Best practice: Flag columns 23
  • 24. Best practice: QA/QC Quality assurance: process-oriented ● Well-designed data sheet ● Training field technicians Quality control: product-oriented (tests of data for quality) ● Consistent codes ● Consistent date formats ● more... 24
  • 25. Best practice: QA/QC ● Range checks ● Sanity checks ● Duplicate observations ● Sensor drift ● Data spikes ● Comparison with nearby stations ● Graphing 25
  • 26. 26 Here is the greenish title slide Summary ● One header row with variable names. ● Descriptive and consistent names for variables (start with a letter, no spaces or symbols, use underscores, no mathematical operators +,-,*,&,^). ● Each variable one column, each cell one value ● Each column should include values for a single variable. ● Each cell should include one value for one variable. ● Each column should include only a single type of data (character, numeric). ● Lines or rows of data should be complete, without empty cells. ● Flags or comments to qualify or describe data when needed to give meaning.
  • 27. 27 Here is the greenish title slide References Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp. 138-141. Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989. Wickham, Hadley (2014) Tidy Data. Journal of Statistical Software. 59: 1-23.

Editor's Notes

  1. Colin talked about how to organize data within a data package. Now I will talk about organize and clean data in a dataset for the purpose of archiving.
  2. Our goal in structuring data for archiving is to store the data so that they can be used in automated ways, with minimal human intervention. We do this with attention to two qualities of the data: First, we want to create a meaningful data structure, and second, we want to compile error free data. With respect to the structure of the dataset, we are after what has been referred to in recent years as “tidy data”, a term used by the R community. Tidy data are structured to be easy to maintain and are also amenable to many different kinds of analyses. The definition of tidy data is simple: each column represents a variable, and each row represents an observation. Beyond tidying the structure of the dataset, it needs to be made as error-free as possible. This is where quality control comes in to play. QC involves examining the data to find inconsistencies in format or accuracy, and to identify unusual, out-of-range values, or detect sensor drift that has to be corrected for.
  3. I want to emphasize that the structure of data to be archived may differ from the way that you organize the data to understand it yourself, or for doing an analysis or generating graphs for a presentation. Spreadsheets, for instance, are frequently organized in a complex way, comprehensible by the “eye”, meaning they are structured to help the viewer understand the data. Archival formats, on the other hand, are optimized for machine-readability.
  4. Here’s an example of human-readable vs. archival-ready data. The dataset on the left contains precipitation data measured at 4 watersheds on every day of the year. The numbers in the table represent precipitation. This dataset is constructed in this way because it is easy to make a graph in Excel showing precipitation in each watershed plotted against the day of the year. This format is nice for humans to be able to read to make comparisons between watershed. But it’s not how we would archive the data. This table that is a “long” format is appropriate for archiving. Each variable occupies a separate column, and each observation is in a single row. This is a tidy format. This is also the format that a lot of software need the data to be in so that it can be readily analyzed.
  5. A lot of data entry and management happens in Excel files. There is a lot you can do with Excel to control data as it gets entered, so it’s a fine tool if used properly. However, I’ll show you a really ugly spreadsheet in order to highlight the kinds of issues you may run into if you are asked to archive a dataset from Excel, and also to provide examples of practices that should be avoided. So this is my UglyData.xlsx. These are data from a phenology study. Phenology refers to the timing of life cycle events of plants and animals. In this case, they are plant data so life cycle events include when the plant flowers, when it fruits, when it is vegetative or only has leaves, and so on. Each mini-table represents a sampling event. What’s wrong with this spreadsheet? First of all, there should be one table per spreadsheet, not a bunch of mini-tables like this. Beyond that you can see a lot of inconsistencies in these data.
  6. A lot of data entry and management happens in Excel files. There is a lot you can do with Excel to control data as it gets entered, so it’s a fine tool if used properly. However, I’ll show you a really ugly spreadsheet in order to highlight the kinds of issues you may run into if you are asked to archive a dataset from Excel, and also to provide examples of practices that should be avoided. So this is my UglyData.xlsx. These are data from a phenology study. Phenology refers to the timing of life cycle events of plants and animals. In this case, they are plant data so life cycle events include when the plant flowers, when it fruits, when it is vegetative or only has leaves, and so on. Each mini-table represents a sampling event. The first sampling was done…. What’s wrong with this spreadsheet? First of all, there should be one table per spreadsheet, not a bunch of mini-tables like this. Data structured like this are impossible for a computer to parse. Data structured like this cannot be imported into a program like R to analyze, either. Beyond that you can see a lot of inconsistencies in these data.
  7. These three mini-tables need to be combined into one data table for analysis and archiving. To do that, all dates will need to be in the same format so they can be easily machine-readable. They are all formatted differently.
  8. Codes are also applied inconsistently in these data. In the phenophase column, the technician is supposed to record the phenological stage of plants encountered in a plot. Plants can be scored as being in one of four conditions. Plants can have flowers, fruit, both, or just leaves. There should only be four codes used in the Phenophase column. Yet in this first minitable, there are six codes. This begs the question, are the codes Fr +Flwr and FF the same thing? A human can make interpretations, but a computer cannot. Codes should be used consistently. Here, you can see inconsistencies in codes used between mini-tables, also. FLWR is in uppercase letters in the second table on the right, while it is a mixture of upper and lower case letters in the first table. The computer will not know these are the same thing.
  9. You may receive a data set that contains both data and also some statistics calculated by the data provider. Statistics don’t belong in the table with the data. They are two different things.
  10. Similarly, There should only be one type of data entered into each column. A column should contain only text, numbers or datetime formatted data. In the first table, the cover column, which is a percent, should only contain numeric data, yet here it also contains a T. T may refer to trace, but a better practice would be to enter a very small percentage in this column, like 1 or 0.5. Similarly, Symbols should not be entered into a numeric column. In the second table on the right, a less than 5 has been entered in the numeric cover column. Excel won’t know what to do with this text in a numeric column when doing calculations, and neither will other analytical programs. A better choice is to enter a small numeric value.
  11. Here I am starting to format the data to be tidy. I’ve combined the three mini-tables, but I’ve left some open cells because I think it’s understood that dates should fill down. The human understands, but the computer does not. If I were to sort the data on Species, take a look at what happens to observation 22. So, it is best when using Excel to fill every cell to avoid problems like this.
  12. To summarize, here is what the tidy phenology data should look like.
  13. There are other best practices for formatting data that I’ll talk about without reference to Excel.
  14. It is recommended to Use descriptive file names to help you and future users of the data quickly ascertain what is in the file. A bad file name ... Who knows if Excel and Word software will still be around to read their proprietary formats 100 years from now.
  15. Another best practice is to use a standard date format to avoid ambiguity about what date time refers to. For instance, 02-03-04 means …. So it is recommended to use a standard date format such as the ISO 8601 standard. This standard looks like this … YYYY-MM-DD …. This format is used commonly across data environments and data repositories. Data become easier to integrate if all sources are using the same date standard.
  16. Another best practice is that each cell of a dataset should contain only one piece of information. This is to avoid adding complexity when subsetting the data, analyzing it or joining it with other data. Let’s consider this from an example. Suppose you are doing a study on effects of temperature, and precipitation on plant growth in a desert. One might be tempted to create a complex identifier as shown here for Location_ID
  17. So, suppose that you have blank cells in your spreadsheet. Data are missing. What should you do? Should you fill the cells in with zeros? No. Zero is different than no observation. Zero means something was looked for, and it wasn’t there. We recommend filling empty cells so it is clear that they aren’t a mistake, so a secondary user later on doesn’t wonder why those cells are empty.
  18. If you need to supply additional information about a data point, you can do so using flags, as shown here. This dataset contains Nitrate and ammonium concentrations in stream water.
  19. Once you have wrestled your data into a tidy form, there are other ways to improve the quality of the data through QC. What is the difference between QA and QC? Quality assurance is process-oriented. Quality assurance has been done by the PI who designed the datasheets for ease of data collection, who trained their technicians in species identification, and other process-oriented things that were done to ensure quality data were collected. Quality control refers to tests of the data for quality. We’ve already talked about some of the kinds of tests you can do, such as filtering for inconsistent codes, making sure dates are all in the same format, and removing other inconsistencies. There are lots of other quality control tests you may want to do.
  20. A tree this year has a diameter of 100 cm, but last year it had a diameter of 20 cm. Might be a data entry error or measuring error.
  21. As you develop a plan for how you’re going to clean your datasets, you may want to refer to these characteristics.