SlideShare a Scribd company logo
1 of 30
Organizing Data into Publishable Units
1
(Phase 1)
2
Background
Well organized data publications optimize understanding and reuse. Beyond an
organization scheme that meets your immediate needs, you’ll generally want to
publish data as interoperable “data packages” that can be combined in unforeseen
ways to answer future scientific questions. This optimization can be challenging and
requires your expertises and discretion.
3
Here is the greenish title slide
Objectives
Become familiar with factors underlying data package organization.
Be confident in resolving competing factors when necessary.
Create a plan to organize your data for publication.
What is a data package?
Data Package (noun): an assemblage of science metadata and one or more science
data objects; data packages include a quality report object and are described by
package metadata called a “resource map” (i.e. manifest)
4
Science Metadata
001010001011010110110101
01010101000111010010101
0001011001010101010001
1101100101010100...
Science Data Quality Report
✓
✓
✗
✓
1. Science Metadata
2. Science Data
3. Quality Report
Resource Map
+ + +
Data Package
YOU are responsible
for this
What is a data package?
Data packages are:
● Immutable - so data and metadata are trustworthy (e.g. to repeat an analysis)
● Versionable - so data can be updated and previous versions still available
● Citable - assigned a Digital Object Identifier (DOI) for each new package or
revision
5
Science Metadata
001010001011010110110101
01010101000111010010101
0001011001010101010001
1101100101010100...
Science Data Quality Report
✓
✓
✗
✓
1. Science Metadata
2. Science Data
3. Quality Report
Resource Map
+ + +
Data Package
YOU are responsible
for this
Creating data packages
Generally, you want to publish the minimally processed data first then build derived
products from it. This creates a basis for building derived products in an
interoperable manner through “provenance” links.
6
Time
Using data packages
Generally, you want to publish the minimally processed data first then build derived
products from it. This creates a basis for building derived products in an
interoperable manner through “provenance” links.
7
Time
Creating data packages
Generally, you want to publish the minimally processed data first then build derived
products from it. This creates a basis for building derived products in an
interoperable manner through “provenance” links.
8
Time
Creating data packages
Generally, you want to publish the minimally processed data first then build derived
products from it. This creates a basis for building derived products in an
interoperable manner through “provenance” links.
9
Time
Creating data packages
Generally, you want to publish the minimally processed data first then build derived
products from it. This creates a basis for building derived products in an
interoperable manner through “provenance” links.
10
Time
11
Exercise!
Imagine a massive dataset under collection for the next several years where
physical, chemical, and biological parameters are sampled from a hundred paired
land-lake ecosystems at temporal frequencies ranging from minute to month. These
data are to be published quarterly for use in unknown scientific inquiries.
What factors would you apply to organize these data into discrete interoperable
units to support the wide breadth of science?
Factors to consider when organizing data
Duplication - The data are an exact copy or subset of an existing dataset
Value - The current value and estimated future value of the data
Relation - The degree to which one data object is related to another
Theme - The type of observation collected
Methodology - How the data were created
Location - Where the data were collected
Volume - The data size
Processing level - Degree of modification from the original data
Collection status - Expected sampling and publication frequency
Temporal frequency - The rate of sampling (or should it be rate of publication?)
Structure - Data values and the relationships among them
Scientific domain - A combination of scientific inquiry and measurement type
File format - The data encoding 12
Duplication
Have these exact data been published elsewhere? If so then don’t republish.
Duplicates create maintenance issues and confusion for users.
Exception:
● Mutable data licensed under the public
domain (example). Consult with data
providers before publishing.
Note: Derived data are not exact copies and can be archived (example). 13
Value
What is the current value and estimated future value of the data? How likely are
they to be reused? Believe it or not, data are not inherently valuable.
Some attributes contributing to value:
● Long-term observation
● Large spatial coverage
● Observation of rare events
● High quality measurements
● Integratable with similar data
● Unanswered questions remain
● New data for old hypothesis
When in doubt … archive it! 14
Theme
What thematic category do the data belong to (e.g. biological, chemical, physical)?
Themes are often nested within each other and can overlap. Distinctly different
themes should be published separately (example-1, example-2, example-3).
Exception:
● Group themes when it improves
discovery, understanding, and use
(example).
15
Methodology
What methods were used to create the data? Identical are published together
(example) and different methods are published separately (example-1, example-2). A
methods change can alter accuracy, precision, etc. of a time series and the relation of
the data but grouping into separate packages makes the data harder to find.
Exceptions:
● Grouping identical methods
create data that are too large
for upload/download.
● Metadata and structure clearly
communicate differences in
methods
16
What other data (or objects) are closely
related, or required for understanding?
Adding these facilitate discovery and
increase the possibility of reuse. When
related data can’t be included, use unique
keywords to form a “collection” or links in
the metadata.
Relation
17
Examples:
● Site information
● Environmental characteristics
● Experimental layout
● Software, programs, scripts
● Instrument calibration reports
● Sampling apparatus schematic
Volume
Are the data volume large? Big data is slow to upload for publishing and slow to
download for use. Consider breaking the data up into smaller and more manageable
units (example-1, example-2, example-3).
18
Location
Were the data collected in the same physical location? If so they may belong
together. Grouping by location can improve discovery of related data.
Exceptions:
● Many different methods at the
same location “clutter”
organization and understanding
● Size and structure may be good
reasons to separate (example-1,
example-2)
19
Processing level
What level of processing has been applied to the data? Different levels should be
separated.
Exception:
● Data structure
allows multiple levels
(example)
Note: Always archive the “minimally processed” data so future users can apply methods
appropriate to their research. Preserve the original data and append flag columns to
communicate known and potential issues (example). 20
Collection status
Is data collection complete or ongoing? If ongoing, append the new data to the time
series (example). Also consider data structures whose attributes won’t change when
new observations are added to simplify revisioning (example).
Exception:
● When size prohibits upload/download consider grouping the time series by a
temporal unit. (e.g. year; example).
21
Temporal frequency
What are the sampling and publication frequencies? Different sampling frequencies
should be separated for understanding. Different publication frequencies should be
organized into separate packages to effectively communicate updates and to reduce
repository storage costs.
Exception:
● Providing downsampled versions of
common frequencies can simplify
use (example).
Note: Publication should not be more frequent than quarterly. Please consult EDI if you’re
an exception. 22
Structure
How are the data structured? If they can’t be easily combined then publish as
separate data objects or packages. “Wide” tables (example) allow for the most
metadata detail but lack the flexibility of “long” tables (long).
Note: Databases are not very useful “as is”. It’s better to organize the contents into views,
export as .csv tables, and publish as separate (but related) packages. 23
Is there a preferred data format within your
scientific domain? If so consider using it.
Domain formats simplify integration with
similar data and often have support
infrastructure built around them.
Some domain formats:
● Community survey data
● Meteorology
Best Practices for data packages, selected
scientific domains
Scientific domain
24
File format
What file format are the data encoded in? Similar formats should be grouped
together. Multiple formats are welcome when serving different use cases.
Proprietary and unusual formats should be converted to open and common formats
to promote access.
25
Decision trees
26
We’ve summarized these “factors to consider when organizing data” into to three
decision trees focused on answering the questions:
● Should these data be archived?
● Do these data belong in the same data package?
● Do these data belong in the same data object?
While an oversimplification, the decision trees are a helpful starting place.
Should these data be archived?
27
Duplicate?
No
Archive
Yes
Don’t
archive
Valuable?
Yes
No
Do these data belong in the same package?
28
Different package
Same
collection
status?
Yes
Same
theme?
Same
Location?
No
Same
Methods?
Yes
No
Yes
No
Yes
No
Same
Processing
Level?
No
Yes
No
Yes
Same package
Yes
No
Big
data?
Related?
Similar
Methods?
No
Yes
Do these data belong in the same object?
Same object
Yes
No
Same temporal
frequency (rate of
sampling)?
Different object
Same
structure?
Yes
No
29
30
Here is the greenish title slide
Summary
Always publish the minimally processed data so future users can apply processing
methods required by their research. Beyond this, consider publishing your data in a
way that optimizes reuse.
There are many factors to consider when organizing data into publish units
including: Duplication, value, theme methodology, location, relation, collection
status, temporal frequency, file format, structure, processing level, scientific
domain.
Competing factors can be resolved by applying your scientific expertise and
perspectives as a data user, and to do what works best for you and your community.

More Related Content

What's hot

DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE
 
20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary20180322 DataONE Packaging Summary
20180322 DataONE Packaging SummaryDave Vieglais
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareAnita de Waard
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Who will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteWho will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteJisc RDM
 
Publishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecyclePublishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecycleAnita de Waard
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Anita de Waard
 
The Rocky Road to Reuse
The Rocky Road to ReuseThe Rocky Road to Reuse
The Rocky Road to ReuseAnita de Waard
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databasestusharjadhav2611
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE
 
Smith RDAP11 NSF Data Management Plan Case Studies
Smith RDAP11 NSF Data Management Plan Case StudiesSmith RDAP11 NSF Data Management Plan Case Studies
Smith RDAP11 NSF Data Management Plan Case StudiesASIS&T
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseAnita de Waard
 
What funders want you to do with your data
What funders want you to do with your dataWhat funders want you to do with your data
What funders want you to do with your dataLeon Osinski
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...Leon Osinski
 

What's hot (20)

DataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data SharingDataONE Education Module 02: Data Sharing
DataONE Education Module 02: Data Sharing
 
20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary20180322 DataONE Packaging Summary
20180322 DataONE Packaging Summary
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and software
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Who will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynoteWho will use the open data? Mark Humphries keynote
Who will use the open data? Mark Humphries keynote
 
Publishing the Full Research Data Lifecycle
Publishing the Full Research Data LifecyclePublishing the Full Research Data Lifecycle
Publishing the Full Research Data Lifecycle
 
BioSharing - Update - Feb2016
BioSharing - Update - Feb2016BioSharing - Update - Feb2016
BioSharing - Update - Feb2016
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
Optimising Scientific Knowledge Transfer: How Collective Sensemaking Can Ena...
 
The Rocky Road to Reuse
The Rocky Road to ReuseThe Rocky Road to Reuse
The Rocky Road to Reuse
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
Getting data into the data repository
Getting data into the data repositoryGetting data into the data repository
Getting data into the data repository
 
Workingwith dataverserepository
Workingwith dataverserepositoryWorkingwith dataverserepository
Workingwith dataverserepository
 
Setting up a data repository, what does it entail?
Setting up a data repository, what does it entail?Setting up a data repository, what does it entail?
Setting up a data repository, what does it entail?
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data Citation
 
Smith RDAP11 NSF Data Management Plan Case Studies
Smith RDAP11 NSF Data Management Plan Case StudiesSmith RDAP11 NSF Data Management Plan Case Studies
Smith RDAP11 NSF Data Management Plan Case Studies
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
What funders want you to do with your data
What funders want you to do with your dataWhat funders want you to do with your data
What funders want you to do with your data
 
A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...A basic course on Research data management, part 4: caring for your data, or ...
A basic course on Research data management, part 4: caring for your data, or ...
 

Similar to EDI Training Module 4: Organizing Data Into Publishable Units

Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Data Management Planning - 02/21/13
Data Management Planning - 02/21/13Data Management Planning - 02/21/13
Data Management Planning - 02/21/13Lizzy_Rolando
 
Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassAaron Collie
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
 
Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishingVarsha Khodiyar
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesRebekah Cummings
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Make your data great now
Make your data great nowMake your data great now
Make your data great nowDaniel JACOB
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharingJisc RDM
 
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxNeuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxJagannath University
 
Neuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxNeuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxJagannath University
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 

Similar to EDI Training Module 4: Organizing Data Into Publishable Units (20)

Data management plans
Data management plansData management plans
Data management plans
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Data Management Planning - 02/21/13
Data Management Planning - 02/21/13Data Management Planning - 02/21/13
Data Management Planning - 02/21/13
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and Humanities
 
Data management plans
Data management plansData management plans
Data management plans
 
NISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management PlanNISO Training Thursday Crafting a Scientific Data Management Plan
NISO Training Thursday Crafting a Scientific Data Management Plan
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptxNeuroinformatics_Databses_Ontologies_Federated Database.pptx
Neuroinformatics_Databses_Ontologies_Federated Database.pptx
 
Neuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptxNeuroinformatics Databases Ontologies Federated Database.pptx
Neuroinformatics Databases Ontologies Federated Database.pptx
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 

Recently uploaded

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 

Recently uploaded (20)

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 

EDI Training Module 4: Organizing Data Into Publishable Units

  • 1. Organizing Data into Publishable Units 1 (Phase 1)
  • 2. 2 Background Well organized data publications optimize understanding and reuse. Beyond an organization scheme that meets your immediate needs, you’ll generally want to publish data as interoperable “data packages” that can be combined in unforeseen ways to answer future scientific questions. This optimization can be challenging and requires your expertises and discretion.
  • 3. 3 Here is the greenish title slide Objectives Become familiar with factors underlying data package organization. Be confident in resolving competing factors when necessary. Create a plan to organize your data for publication.
  • 4. What is a data package? Data Package (noun): an assemblage of science metadata and one or more science data objects; data packages include a quality report object and are described by package metadata called a “resource map” (i.e. manifest) 4 Science Metadata 001010001011010110110101 01010101000111010010101 0001011001010101010001 1101100101010100... Science Data Quality Report ✓ ✓ ✗ ✓ 1. Science Metadata 2. Science Data 3. Quality Report Resource Map + + + Data Package YOU are responsible for this
  • 5. What is a data package? Data packages are: ● Immutable - so data and metadata are trustworthy (e.g. to repeat an analysis) ● Versionable - so data can be updated and previous versions still available ● Citable - assigned a Digital Object Identifier (DOI) for each new package or revision 5 Science Metadata 001010001011010110110101 01010101000111010010101 0001011001010101010001 1101100101010100... Science Data Quality Report ✓ ✓ ✗ ✓ 1. Science Metadata 2. Science Data 3. Quality Report Resource Map + + + Data Package YOU are responsible for this
  • 6. Creating data packages Generally, you want to publish the minimally processed data first then build derived products from it. This creates a basis for building derived products in an interoperable manner through “provenance” links. 6 Time
  • 7. Using data packages Generally, you want to publish the minimally processed data first then build derived products from it. This creates a basis for building derived products in an interoperable manner through “provenance” links. 7 Time
  • 8. Creating data packages Generally, you want to publish the minimally processed data first then build derived products from it. This creates a basis for building derived products in an interoperable manner through “provenance” links. 8 Time
  • 9. Creating data packages Generally, you want to publish the minimally processed data first then build derived products from it. This creates a basis for building derived products in an interoperable manner through “provenance” links. 9 Time
  • 10. Creating data packages Generally, you want to publish the minimally processed data first then build derived products from it. This creates a basis for building derived products in an interoperable manner through “provenance” links. 10 Time
  • 11. 11 Exercise! Imagine a massive dataset under collection for the next several years where physical, chemical, and biological parameters are sampled from a hundred paired land-lake ecosystems at temporal frequencies ranging from minute to month. These data are to be published quarterly for use in unknown scientific inquiries. What factors would you apply to organize these data into discrete interoperable units to support the wide breadth of science?
  • 12. Factors to consider when organizing data Duplication - The data are an exact copy or subset of an existing dataset Value - The current value and estimated future value of the data Relation - The degree to which one data object is related to another Theme - The type of observation collected Methodology - How the data were created Location - Where the data were collected Volume - The data size Processing level - Degree of modification from the original data Collection status - Expected sampling and publication frequency Temporal frequency - The rate of sampling (or should it be rate of publication?) Structure - Data values and the relationships among them Scientific domain - A combination of scientific inquiry and measurement type File format - The data encoding 12
  • 13. Duplication Have these exact data been published elsewhere? If so then don’t republish. Duplicates create maintenance issues and confusion for users. Exception: ● Mutable data licensed under the public domain (example). Consult with data providers before publishing. Note: Derived data are not exact copies and can be archived (example). 13
  • 14. Value What is the current value and estimated future value of the data? How likely are they to be reused? Believe it or not, data are not inherently valuable. Some attributes contributing to value: ● Long-term observation ● Large spatial coverage ● Observation of rare events ● High quality measurements ● Integratable with similar data ● Unanswered questions remain ● New data for old hypothesis When in doubt … archive it! 14
  • 15. Theme What thematic category do the data belong to (e.g. biological, chemical, physical)? Themes are often nested within each other and can overlap. Distinctly different themes should be published separately (example-1, example-2, example-3). Exception: ● Group themes when it improves discovery, understanding, and use (example). 15
  • 16. Methodology What methods were used to create the data? Identical are published together (example) and different methods are published separately (example-1, example-2). A methods change can alter accuracy, precision, etc. of a time series and the relation of the data but grouping into separate packages makes the data harder to find. Exceptions: ● Grouping identical methods create data that are too large for upload/download. ● Metadata and structure clearly communicate differences in methods 16
  • 17. What other data (or objects) are closely related, or required for understanding? Adding these facilitate discovery and increase the possibility of reuse. When related data can’t be included, use unique keywords to form a “collection” or links in the metadata. Relation 17 Examples: ● Site information ● Environmental characteristics ● Experimental layout ● Software, programs, scripts ● Instrument calibration reports ● Sampling apparatus schematic
  • 18. Volume Are the data volume large? Big data is slow to upload for publishing and slow to download for use. Consider breaking the data up into smaller and more manageable units (example-1, example-2, example-3). 18
  • 19. Location Were the data collected in the same physical location? If so they may belong together. Grouping by location can improve discovery of related data. Exceptions: ● Many different methods at the same location “clutter” organization and understanding ● Size and structure may be good reasons to separate (example-1, example-2) 19
  • 20. Processing level What level of processing has been applied to the data? Different levels should be separated. Exception: ● Data structure allows multiple levels (example) Note: Always archive the “minimally processed” data so future users can apply methods appropriate to their research. Preserve the original data and append flag columns to communicate known and potential issues (example). 20
  • 21. Collection status Is data collection complete or ongoing? If ongoing, append the new data to the time series (example). Also consider data structures whose attributes won’t change when new observations are added to simplify revisioning (example). Exception: ● When size prohibits upload/download consider grouping the time series by a temporal unit. (e.g. year; example). 21
  • 22. Temporal frequency What are the sampling and publication frequencies? Different sampling frequencies should be separated for understanding. Different publication frequencies should be organized into separate packages to effectively communicate updates and to reduce repository storage costs. Exception: ● Providing downsampled versions of common frequencies can simplify use (example). Note: Publication should not be more frequent than quarterly. Please consult EDI if you’re an exception. 22
  • 23. Structure How are the data structured? If they can’t be easily combined then publish as separate data objects or packages. “Wide” tables (example) allow for the most metadata detail but lack the flexibility of “long” tables (long). Note: Databases are not very useful “as is”. It’s better to organize the contents into views, export as .csv tables, and publish as separate (but related) packages. 23
  • 24. Is there a preferred data format within your scientific domain? If so consider using it. Domain formats simplify integration with similar data and often have support infrastructure built around them. Some domain formats: ● Community survey data ● Meteorology Best Practices for data packages, selected scientific domains Scientific domain 24
  • 25. File format What file format are the data encoded in? Similar formats should be grouped together. Multiple formats are welcome when serving different use cases. Proprietary and unusual formats should be converted to open and common formats to promote access. 25
  • 26. Decision trees 26 We’ve summarized these “factors to consider when organizing data” into to three decision trees focused on answering the questions: ● Should these data be archived? ● Do these data belong in the same data package? ● Do these data belong in the same data object? While an oversimplification, the decision trees are a helpful starting place.
  • 27. Should these data be archived? 27 Duplicate? No Archive Yes Don’t archive Valuable? Yes No
  • 28. Do these data belong in the same package? 28 Different package Same collection status? Yes Same theme? Same Location? No Same Methods? Yes No Yes No Yes No Same Processing Level? No Yes No Yes Same package Yes No Big data? Related? Similar Methods? No Yes
  • 29. Do these data belong in the same object? Same object Yes No Same temporal frequency (rate of sampling)? Different object Same structure? Yes No 29
  • 30. 30 Here is the greenish title slide Summary Always publish the minimally processed data so future users can apply processing methods required by their research. Beyond this, consider publishing your data in a way that optimizes reuse. There are many factors to consider when organizing data into publish units including: Duplication, value, theme methodology, location, relation, collection status, temporal frequency, file format, structure, processing level, scientific domain. Competing factors can be resolved by applying your scientific expertise and perspectives as a data user, and to do what works best for you and your community.

Editor's Notes

  1. Image: Network of interoperable units.
  2. Image: Network of interoperable units.
  3. Image: Network of interoperable units.
  4. Image: Network of interoperable units.
  5. Image: Network of interoperable units.
  6. Image: Interoperability
  7. The first couple inform whether the data are worth publishing at all. The remainder inform whether the data belong in the same package and same object
  8. Warning: Errors introduced when archiving an existing data source will propagate error through downstream use.
  9. Determines whether the data belong in the same or different data objects
  10. Determines whether the data belong in the same or different data objects
  11. Image: Pay to access information. Inaccessible information.
  12. Assumes you have a corpus of data to be organized