EDI Training Module 5: Creating Clean Data foro Publishing

Creating Clean Data for Publishing
1
(Phase 2)

2
Background
Clean data for analysis is often not equivalent to clean data for archiving. An archive
ready dataset contains features of the processes of future use and revisioning.
Furthermore the concept of clean varies with data type (e.g. table, image, vector,
code).

3
Here is the greenish title slide
Objectives
Discuss best practices for formatting a tabular dataset to make it ready to archive.
Identify activities associated with QA/QC.

Tabular data for archiving
Goal is to store the data so that they can be used in automated ways, with minimal
human intervention:
● Create meaningful data structure (tidy data)
○ Easy to maintain, analyze and reuse
○ Each column = a variable; each row = an observation
● Compile error free data
○ QA/QC
○ Consistent data in terms of format or accuracy; impossible values; sensor
drift
4

Tabular data for archiving
… are often in a different form than data for analysis and presentation.
For example, spreadsheets are frequently organized in complex form,
comprehensible by the “eye”, or data are prepared as input to specialized software.
Archival formats require long-term readability by computers (simple, consistent
format)
5

Precipitation in Four Watersheds by Date
Human-readable vs. archive-ready
6
“Long” format
“Wide” format

What’s wrong with this spreadsheet?
7

8
8
Mini-tables and data
have inconsistencies

9
9
Date formats
differ

10
Codes are
inconsistent!
Plants have flowers,
fruit, both, or just leaves
Does Fr+Flwr
mean the same
as FF?
Codes are inconsistent

11
Summary
information
mixed with
raw data!
11
Summary information is
mixed with raw data

12
Text data is mixed
with numeric data

13
Sort on Species

Tidy Phenology Data
14
14
● Each row = an observation
● Each column = a variable
● This file is easy to maintain and
use
● Date are computer-readable
● Structure is easy to describe in
metadata

How would you make these data tidy?
15

Make these data tidy?
Add a Site column!
16

Best practices for tabular data
Some best practices for formatting tabular data:
● File names
● Column names
● Date and time formats
● One value per cell
● Missing value codes
● Flag columns
● Quality Assurance/Quality Control (QA/QC)
17

Best practice: File names
Use descriptive file names (what,where,when)
● Bad file name: PlotData.xlsx
● Good file name: FCE_SawgrassNPP_2019.xlsx
Store data in a non-proprietary format:
● Excel -> .csv
● Word -> .pdf
18

Best practice: Column names
● Single header row with column names
● Column names should start with a letter and not include spaces or symbols
(other than the underscore (e.g., soil_temperature)
● +,-,*,&,^ are often treated as operators and so should not be used in column
names
● Don’t include units or definition of the variable
19
Bad Column Name Good Column Name
DOC Concentration (mg/ml) DOC_Concentration
Fruit/Flower FruitFlower or Fruit_Flower
Fine earth subsample mass, after oven-
drying (g)
FineEarthSubMass

Best practice: Date and time formats
● 02-03-04 means February 3, 2004 in the US, but the order of month, day, year
is ambiguous to others.
● 02-03-04 might look like March 4, 2002 in other countries.
ISO 8601 Standard:
● YYYY-MM-DD 2020-05-28
● YYYY-MM-DD hh:mm:ss 2020-05-28 15:52:38
Best practices for date and time formats
20

Best practice: One value per cell
21
An experiment is replicated at three sites, with six plots per site

Best practice: Missing value codes
● Differentiate between “0” and “no observation” (no empty cells)
● Possible values: -9999, NA, NULL, NaN and others
● Explain the missing value code in metadata
22

Best practice: Flag columns
23

Best practice: QA/QC
Quality assurance: process-oriented
● Well-designed data sheet
● Training field technicians
Quality control: product-oriented (tests of data for quality)
● Consistent codes
● Consistent date formats
● more...
24

Best practice: QA/QC
● Range checks
● Sanity checks
● Duplicate observations
● Sensor drift
● Data spikes
● Comparison with nearby stations
● Graphing
25

26
Summary
● One header row with variable names.
● Descriptive and consistent names for variables (start with a letter, no spaces or
symbols, use underscores, no mathematical operators +,-,*,&,^).
● Each variable one column, each cell one value
● Each column should include values for a single variable.
● Each cell should include one value for one variable.
● Each column should include only a single type of data (character, numeric).
● Lines or rows of data should be complete, without empty cells.
● Flags or comments to qualify or describe data when needed to give meaning.

27
References
Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and
Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp.
138-141.
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The
American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
Wickham, Hadley (2014) Tidy Data. Journal of Statistical Software. 59: 1-23.

EDI Training Module 5: Creating Clean Data foro Publishing

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to EDI Training Module 5: Creating Clean Data foro Publishing

Similar to EDI Training Module 5: Creating Clean Data foro Publishing (20)

Recently uploaded

Recently uploaded (20)

EDI Training Module 5: Creating Clean Data foro Publishing

Editor's Notes