Introduction to the EDI Data Repository
1
(Phase 4)
2
Background
Background
3
Here is the greenish title slide
Objectives
Objectives
What is the EDI Data Repository?
● An Internet accessible open access data repository
● Uses the PASTA+ data repository software stack
● Metadata-driven publication workflow
● Generates Digital Object Identifiers for all public data packages
● Supports two DataONE member nodes
● Contains about 44,000 unique data packages
● Stores about 11TB of data
● Uses Amazon AWS Glacier for off-line/site storage
4
History
5
time
(not to scale)
today
2016
2013
2010
2007
DOIs
minted
Early NIS
discussions
NIS/PASTA user
testing and
evaluation
LTER NIS
Production
release
PASTA
development
begins
2nd
LTER
MN
Transitions to
EDI Data
Repository
EDI
MN
44,000
Data
Packages
DataCite
Membership
LTER Network EDI
2009
1st
LTER
MN
History
6
today
2016
2013
2010
2007
DOIs
minted
Early NIS
discussions
NIS/PASTA user
testing and
evaluation
LTER NIS
Production
release
PASTA
development
begins
2nd
LTER
MN
Transitions to
EDI Data
Repository
EDI
MN
44,000
Data
Packages
DataCite
Membership
2009
1st
LTER
MN
time
(not to scale)
LTER Network EDI
History
7
today
2016
2013
2010
2007
LTER NIS
Production
release
PASTA
development
begins
Transitions to
EDI Data
Repository
44,000
Data
Packages
LTER Network EDI
2009
time
(not to scale)
Architecture
8
Data
Package
Manager
Gatekeeper
Audit
Manager
Apache
Solr
PASTA+
SOA
data
store
PAS·TA /ˈpästə/ (noun): loose acronym
for the Provenance Aware Synthesis
Tracking Architecture; a metadata-
driven data repository software stack
written in Java; utilizes a Service
Oriented Architecture (SOA) design
pattern with public Application
Programmable Interface (API)
Data Portal
9
https://portal.edirepository.org/nis
Data package
10
Data Package (noun): an assemblage of science metadata and one or more science
data objects; data packages include a quality report object and are described by
package metadata called a “resource map” (i.e. manifest)
Science Metadata
001010001011010110110101
01010101000111010010101
0001011001010101010001
1101100101010100...
Science Data Quality Report
✓
✓
✗
✓
1. Science Metadata
2. Science Data
3. Quality Report
Resource Map
+ + +
Data Package
YOU are responsible
for this
Data package identifiers
Package Identifier (noun): a string value that uniquely identifies the data package
within the EDI Data Repository.
11
edi.10.1
Data package identifiers
12
scope:identifier:revision
edi.10.1
Data package identifiers
13
edi.10.1
scope:identifier:revision
String value that identifies the organization, project, or theme of the data package
Data package identifiers
14
edi.10.1
scope:identifier:revision
Integer value that uniquely identifies the data package in the namespace of the
scope
Data package identifiers
15
edi.10.1
scope:identifier:revision
Integer value in increasing order that identifies the version of the data package
Data package versioning
● PASTA+ enforces strong versioning - published data are immutable
● To add/modify metadata or data to a data package, you must upload a new
revision of your EML metadata
● Within the new EML metadata, you must increment the “revision” value of the
package identifier
16
Data package quality evaluation
A series of quality checks for…
Metadata validation
● Well formed and schema valid
● Content validation (does content match best practices?)
Data validation
● Accessible (can data be downloaded?)
Congruence validation
● Metadata description of data matches physical structure of data (e.g., correct
number of columns, rows, datatype, delimiters)
17
The quality evaluation life-cycle
18
Publish
EML
Evaluation
life-cycle
?
EML
upload
EML
validation
Data
validation
Congruence
validation
1.
2.
3.
4.
Quality evaluation report
● Valid - quality check meets criteria
● Warn - quality check does not meet criteria, but does not fail upload
● Error - quality check does not meet criteria, results in failed upload
● Info - quality check only provides information
19
Quality evaluation report
20
Repository “environments”
If I were a Worldly God...
21
Development
Earth
Staging
Earth
Production
Earth
22
Here is the greenish title slide
Summary
Summary of objectives

EDI Training Module 10: EDI Data Repository Overview