SlideShare a Scribd company logo
1 of 27
Creating Clean Data for Publishing
1
(Phase 2)
2
Background
Clean data for analysis is often not equivalent to clean data for archiving. An archive
ready dataset contains features of the processes of future use and revisioning.
Furthermore the concept of clean varies with data type (e.g. table, image, vector,
code).
3
Here is the greenish title slide
Objectives
Discuss best practices for formatting a tabular dataset to make it ready to archive.
Identify activities associated with QA/QC.
Tabular data for archiving
Goal is to store the data so that they can be used in automated ways, with minimal
human intervention:
● Create meaningful data structure (tidy data)
○ Easy to maintain, analyze and reuse
○ Each column = a variable; each row = an observation
● Compile error free data
○ QA/QC
○ Consistent data in terms of format or accuracy; impossible values; sensor
drift
4
Tabular data for archiving
… are often in a different form than data for analysis and presentation.
For example, spreadsheets are frequently organized in complex form,
comprehensible by the “eye”, or data are prepared as input to specialized software.
Archival formats require long-term readability by computers (simple, consistent
format)
5
Precipitation in Four Watersheds by Date
Human-readable vs. archive-ready
6
“Long” format
“Wide” format
What’s wrong with this spreadsheet?
7
What’s wrong with this spreadsheet?
8
What’s wrong with this spreadsheet?
8
Mini-tables and data
have inconsistencies
What’s wrong with this spreadsheet?
9
9
Date formats
differ
What’s wrong with this spreadsheet?
10
Codes are
inconsistent!
Plants have flowers,
fruit, both, or just leaves
Does Fr+Flwr
mean the same
as FF?
Codes are inconsistent
What’s wrong with this spreadsheet?
11
Summary
information
mixed with
raw data!
11
Summary information is
mixed with raw data
What’s wrong with this spreadsheet?
12
Text data is mixed
with numeric data
What’s wrong with this spreadsheet?
13
Sort on Species
Tidy Phenology Data
14
14
● Each row = an observation
● Each column = a variable
● This file is easy to maintain and
use
● Date are computer-readable
● Structure is easy to describe in
metadata
How would you make these data tidy?
15
Make these data tidy?
Add a Site column!
16
Best practices for tabular data
Some best practices for formatting tabular data:
● File names
● Column names
● Date and time formats
● One value per cell
● Missing value codes
● Flag columns
● Quality Assurance/Quality Control (QA/QC)
17
Best practice: File names
Use descriptive file names (what,where,when)
● Bad file name: PlotData.xlsx
● Good file name: FCE_SawgrassNPP_2019.xlsx
Store data in a non-proprietary format:
● Excel -> .csv
● Word -> .pdf
18
Best practice: Column names
● Single header row with column names
● Column names should start with a letter and not include spaces or symbols
(other than the underscore (e.g., soil_temperature)
● +,-,*,&,^ are often treated as operators and so should not be used in column
names
● Don’t include units or definition of the variable
19
Bad Column Name Good Column Name
DOC Concentration (mg/ml) DOC_Concentration
Fruit/Flower FruitFlower or Fruit_Flower
Fine earth subsample mass, after oven-
drying (g)
FineEarthSubMass
Best practice: Date and time formats
● 02-03-04 means February 3, 2004 in the US, but the order of month, day, year
is ambiguous to others.
● 02-03-04 might look like March 4, 2002 in other countries.
ISO 8601 Standard:
● YYYY-MM-DD 2020-05-28
● YYYY-MM-DD hh:mm:ss 2020-05-28 15:52:38
Best practices for date and time formats
20
Best practice: One value per cell
21
An experiment is replicated at three sites, with six plots per site
Best practice: Missing value codes
● Differentiate between “0” and “no observation” (no empty cells)
● Possible values: -9999, NA, NULL, NaN and others
● Explain the missing value code in metadata
22
Best practice: Flag columns
23
Best practice: QA/QC
Quality assurance: process-oriented
● Well-designed data sheet
● Training field technicians
Quality control: product-oriented (tests of data for quality)
● Consistent codes
● Consistent date formats
● more...
24
Best practice: QA/QC
● Range checks
● Sanity checks
● Duplicate observations
● Sensor drift
● Data spikes
● Comparison with nearby stations
● Graphing
25
26
Here is the greenish title slide
Summary
● One header row with variable names.
● Descriptive and consistent names for variables (start with a letter, no spaces or
symbols, use underscores, no mathematical operators +,-,*,&,^).
● Each variable one column, each cell one value
● Each column should include values for a single variable.
● Each cell should include one value for one variable.
● Each column should include only a single type of data (character, numeric).
● Lines or rows of data should be complete, without empty cells.
● Flags or comments to qualify or describe data when needed to give meaning.
27
Here is the greenish title slide
References
Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and
Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp.
138-141.
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The
American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
Wickham, Hadley (2014) Tidy Data. Journal of Statistical Software. 59: 1-23.

More Related Content

What's hot

Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Elementary data organisation
Elementary data organisationElementary data organisation
Elementary data organisationMuzamil Hussain
 
20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary20180322 DataONE Packaging Summary
20180322 DataONE Packaging SummaryDave Vieglais
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structuresNagajothiN1
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...datacite
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)mhb120
 
Schema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive DataSchema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive DataLars Gleim
 
A basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyA basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyLeon Osinski
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingVarsha Khodiyar
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 

What's hot (19)

Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Database
DatabaseDatabase
Database
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
Elementary data organisation
Elementary data organisationElementary data organisation
Elementary data organisation
 
Data structure
Data  structureData  structure
Data structure
 
20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...
 
Over view of data structures
Over view of data structuresOver view of data structures
Over view of data structures
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
2013 DataCite Summer Meeting - FundRef cooperation with CrossRef (Chuck Koshe...
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)
 
Schema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive DataSchema Extraction for Privacy Preserving Processing of Sensitive Data
Schema Extraction for Privacy Preserving Processing of Sensitive Data
 
Big Data - How important it is
Big Data - How important it isBig Data - How important it is
Big Data - How important it is
 
A basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyA basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and why
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data Publishing
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 

Similar to EDI Training Module 5: Creating Clean Data foro Publishing

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance Anaya Zafar
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10AnwarrChaudary
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 

Similar to EDI Training Module 5: Creating Clean Data foro Publishing (20)

Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
algo 1.ppt
algo 1.pptalgo 1.ppt
algo 1.ppt
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
Lecture 1 and 2
Lecture 1 and 2Lecture 1 and 2
Lecture 1 and 2
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

EDI Training Module 5: Creating Clean Data foro Publishing

  • 1. Creating Clean Data for Publishing 1 (Phase 2)
  • 2. 2 Background Clean data for analysis is often not equivalent to clean data for archiving. An archive ready dataset contains features of the processes of future use and revisioning. Furthermore the concept of clean varies with data type (e.g. table, image, vector, code).
  • 3. 3 Here is the greenish title slide Objectives Discuss best practices for formatting a tabular dataset to make it ready to archive. Identify activities associated with QA/QC.
  • 4. Tabular data for archiving Goal is to store the data so that they can be used in automated ways, with minimal human intervention: ● Create meaningful data structure (tidy data) ○ Easy to maintain, analyze and reuse ○ Each column = a variable; each row = an observation ● Compile error free data ○ QA/QC ○ Consistent data in terms of format or accuracy; impossible values; sensor drift 4
  • 5. Tabular data for archiving … are often in a different form than data for analysis and presentation. For example, spreadsheets are frequently organized in complex form, comprehensible by the “eye”, or data are prepared as input to specialized software. Archival formats require long-term readability by computers (simple, consistent format) 5
  • 6. Precipitation in Four Watersheds by Date Human-readable vs. archive-ready 6 “Long” format “Wide” format
  • 7. What’s wrong with this spreadsheet? 7
  • 8. What’s wrong with this spreadsheet? 8 What’s wrong with this spreadsheet? 8 Mini-tables and data have inconsistencies
  • 9. What’s wrong with this spreadsheet? 9 9 Date formats differ
  • 10. What’s wrong with this spreadsheet? 10 Codes are inconsistent! Plants have flowers, fruit, both, or just leaves Does Fr+Flwr mean the same as FF? Codes are inconsistent
  • 11. What’s wrong with this spreadsheet? 11 Summary information mixed with raw data! 11 Summary information is mixed with raw data
  • 12. What’s wrong with this spreadsheet? 12 Text data is mixed with numeric data
  • 13. What’s wrong with this spreadsheet? 13 Sort on Species
  • 14. Tidy Phenology Data 14 14 ● Each row = an observation ● Each column = a variable ● This file is easy to maintain and use ● Date are computer-readable ● Structure is easy to describe in metadata
  • 15. How would you make these data tidy? 15
  • 16. Make these data tidy? Add a Site column! 16
  • 17. Best practices for tabular data Some best practices for formatting tabular data: ● File names ● Column names ● Date and time formats ● One value per cell ● Missing value codes ● Flag columns ● Quality Assurance/Quality Control (QA/QC) 17
  • 18. Best practice: File names Use descriptive file names (what,where,when) ● Bad file name: PlotData.xlsx ● Good file name: FCE_SawgrassNPP_2019.xlsx Store data in a non-proprietary format: ● Excel -> .csv ● Word -> .pdf 18
  • 19. Best practice: Column names ● Single header row with column names ● Column names should start with a letter and not include spaces or symbols (other than the underscore (e.g., soil_temperature) ● +,-,*,&,^ are often treated as operators and so should not be used in column names ● Don’t include units or definition of the variable 19 Bad Column Name Good Column Name DOC Concentration (mg/ml) DOC_Concentration Fruit/Flower FruitFlower or Fruit_Flower Fine earth subsample mass, after oven- drying (g) FineEarthSubMass
  • 20. Best practice: Date and time formats ● 02-03-04 means February 3, 2004 in the US, but the order of month, day, year is ambiguous to others. ● 02-03-04 might look like March 4, 2002 in other countries. ISO 8601 Standard: ● YYYY-MM-DD 2020-05-28 ● YYYY-MM-DD hh:mm:ss 2020-05-28 15:52:38 Best practices for date and time formats 20
  • 21. Best practice: One value per cell 21 An experiment is replicated at three sites, with six plots per site
  • 22. Best practice: Missing value codes ● Differentiate between “0” and “no observation” (no empty cells) ● Possible values: -9999, NA, NULL, NaN and others ● Explain the missing value code in metadata 22
  • 23. Best practice: Flag columns 23
  • 24. Best practice: QA/QC Quality assurance: process-oriented ● Well-designed data sheet ● Training field technicians Quality control: product-oriented (tests of data for quality) ● Consistent codes ● Consistent date formats ● more... 24
  • 25. Best practice: QA/QC ● Range checks ● Sanity checks ● Duplicate observations ● Sensor drift ● Data spikes ● Comparison with nearby stations ● Graphing 25
  • 26. 26 Here is the greenish title slide Summary ● One header row with variable names. ● Descriptive and consistent names for variables (start with a letter, no spaces or symbols, use underscores, no mathematical operators +,-,*,&,^). ● Each variable one column, each cell one value ● Each column should include values for a single variable. ● Each cell should include one value for one variable. ● Each column should include only a single type of data (character, numeric). ● Lines or rows of data should be complete, without empty cells. ● Flags or comments to qualify or describe data when needed to give meaning.
  • 27. 27 Here is the greenish title slide References Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp. 138-141. Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989. Wickham, Hadley (2014) Tidy Data. Journal of Statistical Software. 59: 1-23.

Editor's Notes

  1. Colin talked about how to organize data within a data package. Now I will talk about organize and clean data in a dataset for the purpose of archiving.
  2. Our goal in structuring data for archiving is to store the data so that they can be used in automated ways, with minimal human intervention. We do this with attention to two qualities of the data: First, we want to create a meaningful data structure, and second, we want to compile error free data. With respect to the structure of the dataset, we are after what has been referred to in recent years as “tidy data”, a term used by the R community. Tidy data are structured to be easy to maintain and are also amenable to many different kinds of analyses. The definition of tidy data is simple: each column represents a variable, and each row represents an observation. Beyond tidying the structure of the dataset, it needs to be made as error-free as possible. This is where quality control comes in to play. QC involves examining the data to find inconsistencies in format or accuracy, and to identify unusual, out-of-range values, or detect sensor drift that has to be corrected for.
  3. I want to emphasize that the structure of data to be archived may differ from the way that you organize the data to understand it yourself, or for doing an analysis or generating graphs for a presentation. Spreadsheets, for instance, are frequently organized in a complex way, comprehensible by the “eye”, meaning they are structured to help the viewer understand the data. Archival formats, on the other hand, are optimized for machine-readability.
  4. Here’s an example of human-readable vs. archival-ready data. The dataset on the left contains precipitation data measured at 4 watersheds on every day of the year. The numbers in the table represent precipitation. This dataset is constructed in this way because it is easy to make a graph in Excel showing precipitation in each watershed plotted against the day of the year. This format is nice for humans to be able to read to make comparisons between watershed. But it’s not how we would archive the data. This table that is a “long” format is appropriate for archiving. Each variable occupies a separate column, and each observation is in a single row. This is a tidy format. This is also the format that a lot of software need the data to be in so that it can be readily analyzed.
  5. A lot of data entry and management happens in Excel files. There is a lot you can do with Excel to control data as it gets entered, so it’s a fine tool if used properly. However, I’ll show you a really ugly spreadsheet in order to highlight the kinds of issues you may run into if you are asked to archive a dataset from Excel, and also to provide examples of practices that should be avoided. So this is my UglyData.xlsx. These are data from a phenology study. Phenology refers to the timing of life cycle events of plants and animals. In this case, they are plant data so life cycle events include when the plant flowers, when it fruits, when it is vegetative or only has leaves, and so on. Each mini-table represents a sampling event. What’s wrong with this spreadsheet? First of all, there should be one table per spreadsheet, not a bunch of mini-tables like this. Beyond that you can see a lot of inconsistencies in these data.
  6. A lot of data entry and management happens in Excel files. There is a lot you can do with Excel to control data as it gets entered, so it’s a fine tool if used properly. However, I’ll show you a really ugly spreadsheet in order to highlight the kinds of issues you may run into if you are asked to archive a dataset from Excel, and also to provide examples of practices that should be avoided. So this is my UglyData.xlsx. These are data from a phenology study. Phenology refers to the timing of life cycle events of plants and animals. In this case, they are plant data so life cycle events include when the plant flowers, when it fruits, when it is vegetative or only has leaves, and so on. Each mini-table represents a sampling event. The first sampling was done…. What’s wrong with this spreadsheet? First of all, there should be one table per spreadsheet, not a bunch of mini-tables like this. Data structured like this are impossible for a computer to parse. Data structured like this cannot be imported into a program like R to analyze, either. Beyond that you can see a lot of inconsistencies in these data.
  7. These three mini-tables need to be combined into one data table for analysis and archiving. To do that, all dates will need to be in the same format so they can be easily machine-readable. They are all formatted differently.
  8. Codes are also applied inconsistently in these data. In the phenophase column, the technician is supposed to record the phenological stage of plants encountered in a plot. Plants can be scored as being in one of four conditions. Plants can have flowers, fruit, both, or just leaves. There should only be four codes used in the Phenophase column. Yet in this first minitable, there are six codes. This begs the question, are the codes Fr +Flwr and FF the same thing? A human can make interpretations, but a computer cannot. Codes should be used consistently. Here, you can see inconsistencies in codes used between mini-tables, also. FLWR is in uppercase letters in the second table on the right, while it is a mixture of upper and lower case letters in the first table. The computer will not know these are the same thing.
  9. You may receive a data set that contains both data and also some statistics calculated by the data provider. Statistics don’t belong in the table with the data. They are two different things.
  10. Similarly, There should only be one type of data entered into each column. A column should contain only text, numbers or datetime formatted data. In the first table, the cover column, which is a percent, should only contain numeric data, yet here it also contains a T. T may refer to trace, but a better practice would be to enter a very small percentage in this column, like 1 or 0.5. Similarly, Symbols should not be entered into a numeric column. In the second table on the right, a less than 5 has been entered in the numeric cover column. Excel won’t know what to do with this text in a numeric column when doing calculations, and neither will other analytical programs. A better choice is to enter a small numeric value.
  11. Here I am starting to format the data to be tidy. I’ve combined the three mini-tables, but I’ve left some open cells because I think it’s understood that dates should fill down. The human understands, but the computer does not. If I were to sort the data on Species, take a look at what happens to observation 22. So, it is best when using Excel to fill every cell to avoid problems like this.
  12. To summarize, here is what the tidy phenology data should look like.
  13. There are other best practices for formatting data that I’ll talk about without reference to Excel.
  14. It is recommended to Use descriptive file names to help you and future users of the data quickly ascertain what is in the file. A bad file name ... Who knows if Excel and Word software will still be around to read their proprietary formats 100 years from now.
  15. Another best practice is to use a standard date format to avoid ambiguity about what date time refers to. For instance, 02-03-04 means …. So it is recommended to use a standard date format such as the ISO 8601 standard. This standard looks like this … YYYY-MM-DD …. This format is used commonly across data environments and data repositories. Data become easier to integrate if all sources are using the same date standard.
  16. Another best practice is that each cell of a dataset should contain only one piece of information. This is to avoid adding complexity when subsetting the data, analyzing it or joining it with other data. Let’s consider this from an example. Suppose you are doing a study on effects of temperature, and precipitation on plant growth in a desert. One might be tempted to create a complex identifier as shown here for Location_ID
  17. So, suppose that you have blank cells in your spreadsheet. Data are missing. What should you do? Should you fill the cells in with zeros? No. Zero is different than no observation. Zero means something was looked for, and it wasn’t there. We recommend filling empty cells so it is clear that they aren’t a mistake, so a secondary user later on doesn’t wonder why those cells are empty.
  18. If you need to supply additional information about a data point, you can do so using flags, as shown here. This dataset contains Nitrate and ammonium concentrations in stream water.
  19. Once you have wrestled your data into a tidy form, there are other ways to improve the quality of the data through QC. What is the difference between QA and QC? Quality assurance is process-oriented. Quality assurance has been done by the PI who designed the datasheets for ease of data collection, who trained their technicians in species identification, and other process-oriented things that were done to ensure quality data were collected. Quality control refers to tests of the data for quality. We’ve already talked about some of the kinds of tests you can do, such as filtering for inconsistent codes, making sure dates are all in the same format, and removing other inconsistencies. There are lots of other quality control tests you may want to do.
  20. A tree this year has a diameter of 100 cm, but last year it had a diameter of 20 cm. Might be a data entry error or measuring error.
  21. As you develop a plan for how you’re going to clean your datasets, you may want to refer to these characteristics.