This is module 5 in the EDI Data Publishing training course. In this module, you will learn how to properly format a data file for publishing in the EDI Repository.
2. 2
Background
Clean data for analysis is often not equivalent to clean data for archiving. An archive
ready dataset contains features of the processes of future use and revisioning.
Furthermore the concept of clean varies with data type (e.g. table, image, vector,
code).
3. 3
Here is the greenish title slide
Objectives
Discuss best practices for formatting a tabular dataset to make it ready to archive.
Identify activities associated with QA/QC.
4. Tabular data for archiving
Goal is to store the data so that they can be used in automated ways, with minimal
human intervention:
● Create meaningful data structure (tidy data)
○ Easy to maintain, analyze and reuse
○ Each column = a variable; each row = an observation
● Compile error free data
○ QA/QC
○ Consistent data in terms of format or accuracy; impossible values; sensor
drift
4
5. Tabular data for archiving
… are often in a different form than data for analysis and presentation.
For example, spreadsheets are frequently organized in complex form,
comprehensible by the “eye”, or data are prepared as input to specialized software.
Archival formats require long-term readability by computers (simple, consistent
format)
5
6. Precipitation in Four Watersheds by Date
Human-readable vs. archive-ready
6
“Long” format
“Wide” format
10. What’s wrong with this spreadsheet?
10
Codes are
inconsistent!
Plants have flowers,
fruit, both, or just leaves
Does Fr+Flwr
mean the same
as FF?
Codes are inconsistent
11. What’s wrong with this spreadsheet?
11
Summary
information
mixed with
raw data!
11
Summary information is
mixed with raw data
12. What’s wrong with this spreadsheet?
12
Text data is mixed
with numeric data
14. Tidy Phenology Data
14
14
● Each row = an observation
● Each column = a variable
● This file is easy to maintain and
use
● Date are computer-readable
● Structure is easy to describe in
metadata
17. Best practices for tabular data
Some best practices for formatting tabular data:
● File names
● Column names
● Date and time formats
● One value per cell
● Missing value codes
● Flag columns
● Quality Assurance/Quality Control (QA/QC)
17
18. Best practice: File names
Use descriptive file names (what,where,when)
● Bad file name: PlotData.xlsx
● Good file name: FCE_SawgrassNPP_2019.xlsx
Store data in a non-proprietary format:
● Excel -> .csv
● Word -> .pdf
18
19. Best practice: Column names
● Single header row with column names
● Column names should start with a letter and not include spaces or symbols
(other than the underscore (e.g., soil_temperature)
● +,-,*,&,^ are often treated as operators and so should not be used in column
names
● Don’t include units or definition of the variable
19
Bad Column Name Good Column Name
DOC Concentration (mg/ml) DOC_Concentration
Fruit/Flower FruitFlower or Fruit_Flower
Fine earth subsample mass, after oven-
drying (g)
FineEarthSubMass
20. Best practice: Date and time formats
● 02-03-04 means February 3, 2004 in the US, but the order of month, day, year
is ambiguous to others.
● 02-03-04 might look like March 4, 2002 in other countries.
ISO 8601 Standard:
● YYYY-MM-DD 2020-05-28
● YYYY-MM-DD hh:mm:ss 2020-05-28 15:52:38
Best practices for date and time formats
20
21. Best practice: One value per cell
21
An experiment is replicated at three sites, with six plots per site
22. Best practice: Missing value codes
● Differentiate between “0” and “no observation” (no empty cells)
● Possible values: -9999, NA, NULL, NaN and others
● Explain the missing value code in metadata
22
24. Best practice: QA/QC
Quality assurance: process-oriented
● Well-designed data sheet
● Training field technicians
Quality control: product-oriented (tests of data for quality)
● Consistent codes
● Consistent date formats
● more...
24
25. Best practice: QA/QC
● Range checks
● Sanity checks
● Duplicate observations
● Sensor drift
● Data spikes
● Comparison with nearby stations
● Graphing
25
26. 26
Here is the greenish title slide
Summary
● One header row with variable names.
● Descriptive and consistent names for variables (start with a letter, no spaces or
symbols, use underscores, no mathematical operators +,-,*,&,^).
● Each variable one column, each cell one value
● Each column should include values for a single variable.
● Each cell should include one value for one variable.
● Each column should include only a single type of data (character, numeric).
● Lines or rows of data should be complete, without empty cells.
● Flags or comments to qualify or describe data when needed to give meaning.
27. 27
Here is the greenish title slide
References
Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and
Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp.
138-141.
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The
American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
Wickham, Hadley (2014) Tidy Data. Journal of Statistical Software. 59: 1-23.
Editor's Notes
Colin talked about how to organize data within a data package. Now I will talk about organize and clean data in a dataset for the purpose of archiving.
Our goal in structuring data for archiving is to store the data so that they can be used in automated ways, with minimal human intervention. We do this with attention to two qualities of the data: First, we want to create a meaningful data structure, and second, we want to compile error free data. With respect to the structure of the dataset, we are after what has been referred to in recent years as “tidy data”, a term used by the R community. Tidy data are structured to be easy to maintain and are also amenable to many different kinds of analyses. The definition of tidy data is simple: each column represents a variable, and each row represents an observation.
Beyond tidying the structure of the dataset, it needs to be made as error-free as possible. This is where quality control comes in to play. QC involves examining the data to find inconsistencies in format or accuracy, and to identify unusual, out-of-range values, or detect sensor drift that has to be corrected for.
I want to emphasize that the structure of data to be archived may differ from the way that you organize the data to understand it yourself, or for doing an analysis or generating graphs for a presentation. Spreadsheets, for instance, are frequently organized in a complex way, comprehensible by the “eye”, meaning they are structured to help the viewer understand the data. Archival formats, on the other hand, are optimized for machine-readability.
Here’s an example of human-readable vs. archival-ready data. The dataset on the left contains precipitation data measured at 4 watersheds on every day of the year. The numbers in the table represent precipitation. This dataset is constructed in this way because it is easy to make a graph in Excel showing precipitation in each watershed plotted against the day of the year. This format is nice for humans to be able to read to make comparisons between watershed. But it’s not how we would archive the data.
This table that is a “long” format is appropriate for archiving. Each variable occupies a separate column, and each observation is in a single row. This is a tidy format. This is also the format that a lot of software need the data to be in so that it can be readily analyzed.
A lot of data entry and management happens in Excel files. There is a lot you can do with Excel to control data as it gets entered, so it’s a fine tool if used properly. However, I’ll show you a really ugly spreadsheet in order to highlight the kinds of issues you may run into if you are asked to archive a dataset from Excel, and also to provide examples of practices that should be avoided. So this is my UglyData.xlsx. These are data from a phenology study. Phenology refers to the timing of life cycle events of plants and animals. In this case, they are plant data so life cycle events include when the plant flowers, when it fruits, when it is vegetative or only has leaves, and so on. Each mini-table represents a sampling event.
What’s wrong with this spreadsheet? First of all, there should be one table per spreadsheet, not a bunch of mini-tables like this. Beyond that you can see a lot of inconsistencies in these data.
A lot of data entry and management happens in Excel files. There is a lot you can do with Excel to control data as it gets entered, so it’s a fine tool if used properly. However, I’ll show you a really ugly spreadsheet in order to highlight the kinds of issues you may run into if you are asked to archive a dataset from Excel, and also to provide examples of practices that should be avoided. So this is my UglyData.xlsx. These are data from a phenology study. Phenology refers to the timing of life cycle events of plants and animals. In this case, they are plant data so life cycle events include when the plant flowers, when it fruits, when it is vegetative or only has leaves, and so on. Each mini-table represents a sampling event. The first sampling was done….
What’s wrong with this spreadsheet? First of all, there should be one table per spreadsheet, not a bunch of mini-tables like this. Data structured like this are impossible for a computer to parse. Data structured like this cannot be imported into a program like R to analyze, either. Beyond that you can see a lot of inconsistencies in these data.
These three mini-tables need to be combined into one data table for analysis and archiving. To do that, all dates will need to be in the same format so they can be easily machine-readable. They are all formatted differently.
Codes are also applied inconsistently in these data. In the phenophase column, the technician is supposed to record the phenological stage of plants encountered in a plot. Plants can be scored as being in one of four conditions. Plants can have flowers, fruit, both, or just leaves. There should only be four codes used in the Phenophase column. Yet in this first minitable, there are six codes. This begs the question, are the codes Fr +Flwr and FF the same thing? A human can make interpretations, but a computer cannot. Codes should be used consistently. Here, you can see inconsistencies in codes used between mini-tables, also. FLWR is in uppercase letters in the second table on the right, while it is a mixture of upper and lower case letters in the first table. The computer will not know these are the same thing.
You may receive a data set that contains both data and also some statistics calculated by the data provider. Statistics don’t belong in the table with the data. They are two different things.
Similarly, There should only be one type of data entered into each column. A column should contain only text, numbers or datetime formatted data. In the first table, the cover column, which is a percent, should only contain numeric data, yet here it also contains a T. T may refer to trace, but a better practice would be to enter a very small percentage in this column, like 1 or 0.5. Similarly, Symbols should not be entered into a numeric column. In the second table on the right, a less than 5 has been entered in the numeric cover column. Excel won’t know what to do with this text in a numeric column when doing calculations, and neither will other analytical programs. A better choice is to enter a small numeric value.
Here I am starting to format the data to be tidy. I’ve combined the three mini-tables, but I’ve left some open cells because I think it’s understood that dates should fill down. The human understands, but the computer does not. If I were to sort the data on Species, take a look at what happens to observation 22.
So, it is best when using Excel to fill every cell to avoid problems like this.
To summarize, here is what the tidy phenology data should look like.
There are other best practices for formatting data that I’ll talk about without reference to Excel.
It is recommended to Use descriptive file names to help you and future users of the data quickly ascertain what is in the file. A bad file name ...
Who knows if Excel and Word software will still be around to read their proprietary formats 100 years from now.
Another best practice is to use a standard date format to avoid ambiguity about what date time refers to. For instance, 02-03-04 means …. So it is recommended to use a standard date format such as the ISO 8601 standard. This standard looks like this … YYYY-MM-DD …. This format is used commonly across data environments and data repositories. Data become easier to integrate if all sources are using the same date standard.
Another best practice is that each cell of a dataset should contain only one piece of information. This is to avoid adding complexity when subsetting the data, analyzing it or joining it with other data. Let’s consider this from an example. Suppose you are doing a study on effects of temperature, and precipitation on plant growth in a desert.
One might be tempted to create a complex identifier as shown here for Location_ID
So, suppose that you have blank cells in your spreadsheet. Data are missing. What should you do? Should you fill the cells in with zeros? No. Zero is different than no observation. Zero means something was looked for, and it wasn’t there. We recommend filling empty cells so it is clear that they aren’t a mistake, so a secondary user later on doesn’t wonder why those cells are empty.
If you need to supply additional information about a data point, you can do so using flags, as shown here. This dataset contains Nitrate and ammonium concentrations in stream water.
Once you have wrestled your data into a tidy form, there are other ways to improve the quality of the data through QC. What is the difference between QA and QC? Quality assurance is process-oriented. Quality assurance has been done by the PI who designed the datasheets for ease of data collection, who trained their technicians in species identification, and other process-oriented things that were done to ensure quality data were collected. Quality control refers to tests of the data for quality. We’ve already talked about some of the kinds of tests you can do, such as filtering for inconsistent codes, making sure dates are all in the same format, and removing other inconsistencies. There are lots of other quality control tests you may want to do.
A tree this year has a diameter of 100 cm, but last year it had a diameter of 20 cm. Might be a data entry error or measuring error.
As you develop a plan for how you’re going to clean your datasets, you may want to refer to these characteristics.