EDI Training Module 4: Organizing Data Into Publishable Units

Organizing Data into Publishable Units
1
(Phase 1)

2
Background
Well organized data publications optimize understanding and reuse. Beyond an
organization scheme that meets your immediate needs, you’ll generally want to
publish data as interoperable “data packages” that can be combined in unforeseen
ways to answer future scientific questions. This optimization can be challenging and
requires your expertises and discretion.

3
Here is the greenish title slide
Objectives
Become familiar with factors underlying data package organization.
Be confident in resolving competing factors when necessary.
Create a plan to organize your data for publication.

What is a data package?
Data Package (noun): an assemblage of science metadata and one or more science
data objects; data packages include a quality report object and are described by
package metadata called a “resource map” (i.e. manifest)
4
Science Metadata
001010001011010110110101
01010101000111010010101
0001011001010101010001
1101100101010100...
Science Data Quality Report
✓
✓
✗
✓
1. Science Metadata
2. Science Data
3. Quality Report
Resource Map
+ + +
Data Package
YOU are responsible
for this

What is a data package?
Data packages are:
● Immutable - so data and metadata are trustworthy (e.g. to repeat an analysis)
● Versionable - so data can be updated and previous versions still available
● Citable - assigned a Digital Object Identifier (DOI) for each new package or
revision
5
Science Metadata
001010001011010110110101
01010101000111010010101
0001011001010101010001
1101100101010100...
Science Data Quality Report
✓
✓
✗
✓
1. Science Metadata
2. Science Data
3. Quality Report
Resource Map
+ + +
Data Package
YOU are responsible
for this

Creating data packages
Generally, you want to publish the minimally processed data first then build derived
products from it. This creates a basis for building derived products in an
interoperable manner through “provenance” links.
6
Time

Using data packages
7
Time

8
Time

9
Time

10
Time

11
Exercise!
Imagine a massive dataset under collection for the next several years where
physical, chemical, and biological parameters are sampled from a hundred paired
land-lake ecosystems at temporal frequencies ranging from minute to month. These
data are to be published quarterly for use in unknown scientific inquiries.
What factors would you apply to organize these data into discrete interoperable
units to support the wide breadth of science?

Factors to consider when organizing data
Duplication - The data are an exact copy or subset of an existing dataset
Value - The current value and estimated future value of the data
Relation - The degree to which one data object is related to another
Theme - The type of observation collected
Methodology - How the data were created
Location - Where the data were collected
Volume - The data size
Processing level - Degree of modification from the original data
Collection status - Expected sampling and publication frequency
Temporal frequency - The rate of sampling (or should it be rate of publication?)
Structure - Data values and the relationships among them
Scientific domain - A combination of scientific inquiry and measurement type
File format - The data encoding 12

Duplication
Have these exact data been published elsewhere? If so then don’t republish.
Duplicates create maintenance issues and confusion for users.
Exception:
● Mutable data licensed under the public
domain (example). Consult with data
providers before publishing.
Note: Derived data are not exact copies and can be archived (example). 13

Value
What is the current value and estimated future value of the data? How likely are
they to be reused? Believe it or not, data are not inherently valuable.
Some attributes contributing to value:
● Long-term observation
● Large spatial coverage
● Observation of rare events
● High quality measurements
● Integratable with similar data
● Unanswered questions remain
● New data for old hypothesis
When in doubt … archive it! 14

Theme
What thematic category do the data belong to (e.g. biological, chemical, physical)?
Themes are often nested within each other and can overlap. Distinctly different
themes should be published separately (example-1, example-2, example-3).
Exception:
● Group themes when it improves
discovery, understanding, and use
(example).
15

Methodology
What methods were used to create the data? Identical are published together
(example) and different methods are published separately (example-1, example-2). A
methods change can alter accuracy, precision, etc. of a time series and the relation of
the data but grouping into separate packages makes the data harder to find.
Exceptions:
● Grouping identical methods
create data that are too large
for upload/download.
● Metadata and structure clearly
communicate differences in
methods
16

What other data (or objects) are closely
related, or required for understanding?
Adding these facilitate discovery and
increase the possibility of reuse. When
related data can’t be included, use unique
keywords to form a “collection” or links in
the metadata.
Relation
17
Examples:
● Site information
● Environmental characteristics
● Experimental layout
● Software, programs, scripts
● Instrument calibration reports
● Sampling apparatus schematic

Volume
Are the data volume large? Big data is slow to upload for publishing and slow to
download for use. Consider breaking the data up into smaller and more manageable
units (example-1, example-2, example-3).
18

Location
Were the data collected in the same physical location? If so they may belong
together. Grouping by location can improve discovery of related data.
Exceptions:
● Many different methods at the
same location “clutter”
organization and understanding
● Size and structure may be good
reasons to separate (example-1,
example-2)
19

Processing level
What level of processing has been applied to the data? Different levels should be
separated.
Exception:
● Data structure
allows multiple levels
(example)
Note: Always archive the “minimally processed” data so future users can apply methods
appropriate to their research. Preserve the original data and append flag columns to
communicate known and potential issues (example). 20

Collection status
Is data collection complete or ongoing? If ongoing, append the new data to the time
series (example). Also consider data structures whose attributes won’t change when
new observations are added to simplify revisioning (example).
Exception:
● When size prohibits upload/download consider grouping the time series by a
temporal unit. (e.g. year; example).
21

Temporal frequency
What are the sampling and publication frequencies? Different sampling frequencies
should be separated for understanding. Different publication frequencies should be
organized into separate packages to effectively communicate updates and to reduce
repository storage costs.
Exception:
● Providing downsampled versions of
common frequencies can simplify
use (example).
Note: Publication should not be more frequent than quarterly. Please consult EDI if you’re
an exception. 22

Structure
How are the data structured? If they can’t be easily combined then publish as
separate data objects or packages. “Wide” tables (example) allow for the most
metadata detail but lack the flexibility of “long” tables (long).
Note: Databases are not very useful “as is”. It’s better to organize the contents into views,
export as .csv tables, and publish as separate (but related) packages. 23

Is there a preferred data format within your
scientific domain? If so consider using it.
Domain formats simplify integration with
similar data and often have support
infrastructure built around them.
Some domain formats:
● Community survey data
● Meteorology
Best Practices for data packages, selected
scientific domains
Scientific domain
24

File format
What file format are the data encoded in? Similar formats should be grouped
together. Multiple formats are welcome when serving different use cases.
Proprietary and unusual formats should be converted to open and common formats
to promote access.
25

Decision trees
26
We’ve summarized these “factors to consider when organizing data” into to three
decision trees focused on answering the questions:
● Should these data be archived?
● Do these data belong in the same data package?
● Do these data belong in the same data object?
While an oversimplification, the decision trees are a helpful starting place.

Should these data be archived?
27
Duplicate?
No
Archive
Yes
Don’t
archive
Valuable?
Yes
No

Do these data belong in the same package?
28
Different package
Same
collection
status?
Yes
Same
theme?
Same
Location?
No
Same
Methods?
Yes
No
Yes
No
Yes
No
Same
Processing
Level?
No
Yes
No
Yes
Same package
Yes
No
Big
data?
Related?
Similar
Methods?
No
Yes

Do these data belong in the same object?
Same object
Yes
No
Same temporal
frequency (rate of
sampling)?
Different object
Same
structure?
Yes
No
29

30
Here is the greenish title slide
Summary
Always publish the minimally processed data so future users can apply processing
methods required by their research. Beyond this, consider publishing your data in a
way that optimizes reuse.
There are many factors to consider when organizing data into publish units
including: Duplication, value, theme methodology, location, relation, collection
status, temporal frequency, file format, structure, processing level, scientific
domain.
Competing factors can be resolved by applying your scientific expertise and
perspectives as a data user, and to do what works best for you and your community.

EDI Training Module 4: Organizing Data Into Publishable Units

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to EDI Training Module 4: Organizing Data Into Publishable Units

Similar to EDI Training Module 4: Organizing Data Into Publishable Units (20)

Recently uploaded

Recently uploaded (20)

EDI Training Module 4: Organizing Data Into Publishable Units

Editor's Notes