SlideShare a Scribd company logo
1 of 156
Enabling Precise Identification and
Citability of Dynamic Data:
Recommendations of the
RDA Working Group on Data Citation
Andreas Rauber
Vienna University of Technology
Favoritenstr. 9-11/188
1040 Vienna, Austria
rauber@ifs.tuwien.ac.at
http://ww.ifs.tuwien.ac.at/~andi
Motivation
 Research data is fundamental for science
- Results are based on data
- Data serves as input for workflows and experiments
- Data is the source for graphs and visualisations in publications
 Data is needed for Reproducibility
- Repeat experiments
- Verify / compare results
 Need to provide specific data set
- Service for data repositories
 Put data in data repository,
Assign PID (DOI, Ark, URI, …)
Make is accessible
 done?
Source: open.wien.gv.at
https://commons.wikimedia.org/w/index.php?curid=30978545
Identification of Dynamic Data
 Usually, datasets have to be static
- Fixed set of data, no changes:
no corrections to errors, no new data being added
 But: (research) data is dynamic
- Adding new data, correcting errors, enhancing data quality, …
- Changes sometimes highly dynamic, at irregular intervals
 Current approaches
- Identifying entire data stream, without any versioning
- Using “accessed at” date
- “Artificial” versioning by identifying batches of data (e.g.
annual), aggregating changes into releases (time-delayed!)
 Would like to identify precisely the data
as it existed at a specific point in time
Granularity of Subsets
 What about the granularity of data to be identified?
- Enormous amounts of CSV data
- Researchers use specific subsets of data
- Need to identify precisely the subset used
 Current approaches
- Storing a copy of subset as used in study -> scalability
- Citing entire dataset, providing textual description of subset
-> imprecise (ambiguity)
- Storing list of record identifiers in subset -> scalability,
not for arbitrary subsets (e.g. when not entire record selected)
 Would like to be able to identify precisely the
subset of (dynamic) data used in a process
What we do NOT want…
 Common approaches to data management…
(from PhD Comics: A Story Told in File Names, 28.5.2010)
Source: http://www.phdcomics.com/comics.php?f=1323
 Research Data Alliance
 WG on Data Citation:
Making Dynamic Data Citeable
 March 2014 – September 2015
- Concentrating on the problems of
large, dynamic (changing) datasets
 Final version presented Sep 2015
at P7 in Paris, France
 Endorsed September 2016
at P8 in Denver, CO
https://www.rd-alliance.org/groups/data-citation-wg.html
RDA WG Data Citation
RDA WGDC - Solution
 We have
‒ Data & some means of access („query“)
RDA WGDC - Solution
 We have
‒ Data & some means of access („query“)
 Make
‒ Data: time-stamped and versioned
‒ Query: timestamped
RDA WGDC - Solution
 We have
‒ Data & some means of access („query“)
 Make
‒ Data: time-stamped and versioned
‒ Query: timestamped
 Data Citation:
‒ Store query
‒ Assign the PID to the timestamped query
(which, dynamically, leads to the data)
 Access:
Re-execute query on versioned data according to timestamp
 Dynamic Data Citation:
Dynamic data & dynamic citation of data
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
− Data (package, access API, …)
− PID (e.g. DOI) (Query is time-stamped and stored)
− Hash value computed over the data for local storage
− Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
− Provides detailed metadata, link to parent data set, subset,…
− Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
− Query is re-executed against time-stamped and versioned DB
− Results as above are returned
 Query store aggregates data usage
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
− Data (package, access API, …)
− PID (e.g. DOI) (Query is time-stamped and stored)
− Hash value computed over the data for local storage
− Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
− Provides detailed metadata, link to parent data set, subset,…
− Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
− Query is re-executed against time-stamped and versioned DB
− Results as above are returned
 Query store aggregates data usage
Note: query string provides excellentNote: query string provides excellent
provenance information on the data set!provenance information on the data set!
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
− Data (package, access API, …)
− PID (e.g. DOI) (Query is time-stamped and stored)
− Hash value computed over the data for local storage
− Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
− Provides detailed metadata, link to parent data set, subset,…
− Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
− Query is re-executed against time-stamped and versioned DB
− Results as above are returned
 Query store aggregates data usage
Note: query string provides excellentNote: query string provides excellent
provenance information on the data set!provenance information on the data set!
This is an important advantage overThis is an important advantage over
traditional approaches relying on, e.g.traditional approaches relying on, e.g.
storing a list of identifiers/DB dump!!!storing a list of identifiers/DB dump!!!
Data Citation – Deployment
 Researcher uses workbench to identify subset of data
 Upon executing selection („download“) user gets
− Data (package, access API, …)
− PID (e.g. DOI) (Query is time-stamped and stored)
− Hash value computed over the data for local storage
− Recommended citation text (e.g. BibTeX)
 PID resolves to landing page
− Provides detailed metadata, link to parent data set, subset,…
− Option to retrieve original data OR current version OR changes
 Upon activating PID associated with a data citation
− Query is re-executed against time-stamped and versioned DB
− Results as above are returned
 Query store aggregates data usage
Note: query string provides excellentNote: query string provides excellent
provenance information on the data set!provenance information on the data set!
This is an important advantage overThis is an important advantage over
traditional approaches relying on, e.g.traditional approaches relying on, e.g.
storing a list of identifiers/DB dump!!!storing a list of identifiers/DB dump!!!
Identify which parts of the data are used.Identify which parts of the data are used.
If data changes, identify which queriesIf data changes, identify which queries
(studies) are affected(studies) are affected
Data Citation – Output
 14 Recommendations
grouped into 4 phases:
- Preparing data and query store
- Persistently identifying specific data sets
- Resolving PIDs
- Upon modifications to the data
infrastructure
 2-page flyer
https://
rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.htm
 More detailed report: IEEE TCDL 2016
http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-
TCDL-DC-2016_paper_1.pdf
Data Citation – Recommendations
Preparing Data & Query Store
- R1 – Data Versioning
- R2 – Timestamping
- R3 – Query Store
When Data should be persisted
- R4 – Query Uniqueness
- R5 – Stable Sorting
- R6 – Result Set Verification
- R7 – Query Timestamping
- R8 – Query PID
- R9 – Store Query
- R10 – Citation Text
When Resolving a PID
- R11 – Landing Page
- R12 – Machine Actionability
Upon Modifications to the
Data Infrastructure
- R13 – Technology Migration
- R14 – Migration Verification
Data Citation – Recommendations
A) Preparing the Data and the Query Store
R1 – Data Versioning: Apply versioning to ensure earlier
states of data sets the data can be retrieved
R2 – Timestamping: Ensure that operations on data are
timestamped, i.e. any additions, deletions are marked with a
timestamp
R3 – Query Store: Provide means to store the queries and
metadata to re-execute them in the future
Data Citation – Recommendations
A) Preparing the Data and the Query Store
R1 – Data Versioning: Apply versioning to ensure earlier
states of data sets the data can be retrieved
R2 – Timestamping: Ensure that operations on data are
timestamped, i.e. any additions, deletions are marked with a
timestamp
R3 – Query Store: Provide means to store the queries and
metadata to re-execute them in the future
NoteNote::
•R1 & R2R1 & R2 are already pretty much standard inare already pretty much standard in
many (RDBMS-) research databasesmany (RDBMS-) research databases
•Different ways to implementDifferent ways to implement
•A bit more challenging for some data typesA bit more challenging for some data types
(XML, LOD, …)(XML, LOD, …)
NoteNote::
•R3:R3: query store usually pretty small, even forquery store usually pretty small, even for
extremely high query volumesextremely high query volumes
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (1/2)
When a data set should be persisted:
R4 – Query Uniqueness: Re-write the query to a normalized form so
that identical queries can be detected. Compute a checksum of the
normalized query to efficiently detect identical queries
R5 – Stable Sorting: Ensure an unambiguous sorting of the records
in the data set
R6 – Result Set Verification: Compute fixity information/checksum
of the query result set to enable verification of the correctness of a
result upon re-execution
R7 – Query Timestamping: Assign a timestamp to the query based
on the last update to the entire database (or the last update to the
selection of data affected by the query or the query execution time).
This allows retrieving the data as it existed at query time
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (2/2)
When a data set should be persisted:
R8 – Query PID: Assign a new PID to the query if either the query
is new or if the result set returned from an earlier identical query is
different due to changes in the data. Otherwise, return the existing
PID
R9 – Store Query: Store query and metadata (e.g. PID, original
and normalized query, query & result set checksum, timestamp,
superset PID, data set description and other) in the query store
R10 – Citation Text: Provide citation text including the PID in the
format prevalent in the designated community to lower barrier for
citing data.
Data Citation – Recommendations
C) Resolving PIDs and Retrieving Data
R11 – Landing Page: Make the PIDs resolve to a human readable
landing page that provides the data (via query re-execution) and
metadata, including a link to the superset
(PID of the data source) and citation text snippet
R12 – Machine Actionability: Provide an API / machine actionable
landing page to access metadata and data via query re-execution
Data Citation – Recommendations
D) Upon Modifications to the Data Infrastructure
R13 – Technology Migration: When data is migrated to a new
representation (e.g. new database system, a new schema or a
completely different technology), migrate also the queries and
associated checksums
R14 – Migration Verification: Verify successful data and query
migration, ensuring that queries can be re-executed correctly
Pilot Adopters
 Several adopters
- Different types of data, different settings, …
- CSV & SQL reference implementation (SBA/TUW)
 Pilots:
- Biomedical BigData Sharing, Electronic Health Records
(Center for Biomedical Informatics, Washington Univ. in St. Louis)
- Marine Research Data
Biological & Chemical Oceanography Data Management Office
(BCO-DMO)
- Vermont Monitoring Cooperative: Forest Ecosystem Monitoring
- ARGO Boy Network, British Oceanographic Data Centre (BODC)
- Virtual Atomic and Molecular Data Centre (VAMDC)
- UK Riverflow Riverflow Archive, Centre for Ecology and Hydrology
WG Data Citation Pilot
CBMI @ WUSTL
Cynthia Hudson Vitale, Leslie McIntosh,
Snehil Gupta
Washington University in St.Luis
Biomedical Adoption Project Goals
RDA-MacArthur Grant Focus
Researchers
Data Broker
R1 and R2 Implementation
PostgreSQL Extension
“temporal_tables”
c1 c2 c3
RDC.table
sys_period
c1 c2 c3 sys_period
RDC.hist_table*
*stores history of data
changes
1
2
3
triggers
Return on Investment (ROI) -
Estimated
 20 hours to complete 1 study
 $150/hr (unsubsidized)
 $3000 per study
 115 research studies per year
 14 replication studies
Adoption of Data Citation Outcomes
by BCO-DMO
Cynthia Chandler, Adam Shepherd
 An existing repository (http://bco-dmo.org/)
 Marine research data curation since 2006
 Faced with new challenges, but no new funding
 e.g. data publication practices to support citation
 Used the outcomes from the RDA Data Citation
Working Group to improve data publication and citation
services
A story of success enabled by RDA
https://www.rd-alliance.org/groups/data-citation-wg.html
Adoption of Data Citation Outputs
 Evaluation
- Evaluate recommendations (done December 2015)
- Try implementation in existing BCO-DMO architecture
(work began 4 April 2016)
 Trial
- BCO-DMO: R1-11 fit well with current architecture; R12 doable;
test as part of DataONE node membership; R13-14 are
consistent with Linked Data approach to data publication and
sharing
 Timeline:
- Redesign/protoype completed by 1 June 2016
- New citation recommendation by 1 Sep 2016
- Report out at RDA P8 (Denver, CO) September 2016
- Final report by 1 December 2016
Vermont Monitoring Cooperative
James Duncan, Jennifer Pontius
VMC
Ecosystem Monitoring
Collaborator Network
Data Archive, Access and
Integration
USERWORKFLOWTO DATE
 Modify a dataset
 changes tracked
 original data table unchanged
 Commit to version, assign name
 computes result hash (table pkid, col
names, first col data) and query hash
 updates data table to new state
 formalizes version
 Restore previous version
 creates new version table from current
data table state,
 walks it back using stored SQL.
 Garbage collected after a period of time
UK Riverflow Archive
Matthew Fry
mfry@ceh.ac.uk
Global Water Information Interest Group meeting
RDA 8th
Plenary, 15th
September 2016, Denver
UK National River Flow Archive
• ORACLE Relational database
• Time series and metadata tables
• ~20M daily flow records, + monthly / daily catchment rainfall series
• Metadata (station history, owners, catchment soils / geology, etc.)
• Total size of ~5GB
• Time series tables automatically audited,
• But reconstruction is complex
• Users generally download simple files
• But public API is in development / R-NRFA package is out there
• Fortunately all access is via single codeset
Global Water Information Interest Group meeting
RDA 8th
Plenary, 15th
September 2016, Denver
Versioning / citation solution
• Automated archiving of entire database – version controlled scripts
defining tables, creating / populating archived tables (largely complete)
• Fits in with data workflow – public / dev versions – this only works because
we have irregular / occasional updates
• Simplification of the data model (complete)
• API development (being undertaken independently of dynamic citation
requirements):
• allows subsetting of dataset in a number of ways – initially simply
• need to implement versioning (started) to ensure will cope with changes to data
structures
• Fit to dynamic data citation recommendations?
• Largely
• Need to address mechanism for users to request / create citable version of a query
• Resource required: estimated ~2 person months
Reference Implementation for
CSV Data (and SQL)
Stefan Pröll, SBA
38
CSV/SQL Reference Implementation 1
 Reference Implementation available on Github
https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype
 Upload interface -->Upload CSV files
 Migrate CSV file into RDBMS
−Generate table structure, identify primary key
−Add metadata columns for versioning (transparent)
−Add indices (transparent)
 Dynamic data: upload new version of file
−Versioned update / delete existing records in RDBMS
 Access interface
−Track subset creation
−Store queries -> PID + Landing Page
Barrymieny
39
CSV/Git Reference Implementation 2
 Based on Git only (no SQL database)
 Upload CSV files to Git repository
 SQL-style queries operate on CSV file via API
 Data versioning with Git
 Store scripts versioned as well
 Make subset creation reproducible
40
CSV/Git Reference Implementation 2
 Stefan Pröll, Christoph Meixner, Andreas Rauber
Precise Data Identification Services for Long Tail Research Data.
Proceedings of the intl. Conference on Preservation of Digital Objects
(iPRES2016), Oct. 3-6 2016, Bern, Switzerland.
 Source at Github:
https://github.com/datascience/
RDA-WGDC-CSV-Data-Citation-Prototype
 Videos:
 Login: https://youtu.be/EnraIwbQfM0
 Upload: https://youtu.be/xJruifX9E2U
 Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ
 Resolver: https://youtu.be/FHsvjsUMiiY
 Update: https://youtu.be/cMZ0xoZHUyI
Benefits
 Retrieval of precise subset with low storage overhead
 Subset as cited or as it is now (including e.g. corrections)
 Query provides provenance information
 Query store supports analysis of data usage
 Checksums support verification
 Same principles applicable across all settings
- Small and large data
- Static and dynamic data
- Different data representations (RDBMS, CSV, XML, LOD, …)
 Would work also for more sophisticated/general
transformations on data beyond select/project
Thanks!
https://rd-alliance.org/working-groups/data-citation-wg.html
Thank you!
Reference Implementation for
CSV Data (and SQL)
Stefan Pröll, SBA
 RDA recommendations implemented in data infrastructures
 Required adaptions
- Introduce versioning, if not already in place
- Capture sub-setting process (queries)
- Implement dedicated query store to
store queries
- A bit of additional functionality
(query re-writing, hash functions, …)
 Done! ?
- “Big data”, database driven
- Well-defined interfaces
- Trained experts available
- “Complex, only for professional research infrastructures” ?
Large Scale Research Settings
Long Tail Research Data
Big data,
well organized,
often used and cited
Less well organized,
non-standardised
no dedicated infrastructure
“Dark data”
Amount of data sets
Data set size
[1] Heidorn, P. Bryan. "Shedding light on the dark data in the long tail of science." Library Trends 57.2 (2008): 280-299.
Prototype Implementations
 Solution for small-scale data
- CSV files, no “expensive” infrastructure, low overhead
2 Reference implementations :
 Git based Prototypes: widely used versioning system
- A) Using separate folders
- B) Using branches
 MySQL based Prototype:
- C) Migrates CSV data into relational database
 Data backend responsible for versioning data sets
 Subsets are created with scripts or queries
via API or Web Interface
 Transparent to user: always CSV
CSV Reference Implementation 2
Git Implementation 1
Upload CSV files to Git repository (versioning)
Subsets created via scripting language (e.g. R)
 Select rows/columns, sort, returns CSV + metadata file
 Metdata file with script parameters stored in Git
 (Scripts stored in Git as well)
PID assigned to metadata file
 Use Git to retrieve proper data set version
and re-execute script on retrieved file
Git-based Prototype
Git Implementation 2
Addresses issues
- common commit history, branching data
Using Git branching model:
Orphaned branches for queries and data
- Keeps commit history clean
- Allows merging of data files
Web interface for queries (CSV2JDBC)
Use commit hash for identification
- Assigned PID hashed with SHA1
- Use hash of PID as filename (ensure permissible characters)
Git-based Prototype
Step 1: Select a CSV
file in the repository
Step 2: Create a
subset with a SQL
query (on CSV data)
Step 3: Store the query
script and metadata
Step 4: Re-Execute!
 Prototype: https://github.com/Mercynary/recitable
MySQL-based Prototype
MySQL Prototype
Data upload
- User uploads a CSV file into the system
Data migration from CSV file into RDBMS
- Generate table structure
- Add metadata columns (versioning)
- Add indices (performance)
Dynamic data
- Insert, update and delete records
- Events are recorded with a timestamp
Subset creation
- User selects columns, filters and sorts records in web interface
- System traces the selection process
- Exports CSV
Barrymieny
MySQL-based Prototype
 Source at Github:
‒ https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype
 Videos:
‒ Login: https://youtu.be/EnraIwbQfM0
‒ Upload: https://youtu.be/xJruifX9E2U
‒ Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ
‒ Resolver: https://youtu.be/FHsvjsUMiiY
‒ Update: https://youtu.be/cMZ0xoZHUyI
CSV Reference Implementation 2
 Stefan Pröll, Christoph Meixner, Andreas Rauber
Precise Data Identification Services for Long Tail Research Data.
Proceedings of the intl. Conference on Preservation of Digital Objects
(iPRES2016), Oct. 3-6 2016, Bern, Switzerland.
 Source at Github:
https://github.com/datascience/
RDA-WGDC-CSV-Data-Citation-Prototype
 Videos:
 Login: https://youtu.be/EnraIwbQfM0
 Upload: https://youtu.be/xJruifX9E2U
 Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ
 Resolver: https://youtu.be/FHsvjsUMiiY
 Update: https://youtu.be/cMZ0xoZHUyI
WG Data Citation Pilot
CBMI @ WUSTL
Cynthia Hudson Vitale, Leslie McIntosh,
Snehil Gupta
Washington University in St.Luis
Moving Biomedical Big Data
Sharing Forward
An adoption of the RDA Data Citation of Evolving Data
Recommendation to Electronic Health Records
Leslie McIntosh, PHD, MPH RDA P8
Cynthia Hudson Vitale, MA Denver, USA September
2016
@mcintold
@cynhudson
Background
 Director, Center
for Biomedical
Informatics
 Data Services
Coordinator

Leslie McIntosh, PHD,
MPH


Cynthia Hudson Vitale,
MA

BDaaS
Biomedical Data as a
Service
Researchers
Data
Broker
i2b2
Application
Biomedical
Data
Repository
Move some of the responsibility of reproducibility
Biomedical
Researcher
Biomedical
Pipeline
RDA/MacArthur Grant
Biomedical Adoption Project Goals
Implement RDA Data Citation WG recommendation to
local Washington U i2b2
Engage other i2b2 community adoptees
Contribute source code back to i2b2 community
RDA Data Citation WG
Recommendations
R1: Data Versioning
R2: Data Timestamping
R3, R9: Query Store
R7: Query Timestamping
R8: Query PID
R10: Query Citation
Internal Implementation
Requirements
 Scalable
 Available for PostgreSQL
 Actively supported
 Easy to maintain
 Easy for data brokers to use
RDA-MacArthur Grant Focus
Researchers
Data Broker
R1 and R2 Implementation
PostgreSQL Extension
“temporal_tables”
c1 c2 c3
RDC.table
sys_period
c1 c2 c3 sys_period
RDC.hist_table*
*stores history of data
changes
1
2
3
triggers
ETL Incrementals
Source Data
Update?
… sys_period
2016-9-9 00:00,
NULL
RDC.table
Old Data
… sys_period
2016-9-8 00:00,
2016-9-9 00:00
RDC.hist_table
Insert?
… sys_period
2016-9-9 00:00,
NULL
RDC.table
R3, R7, R8, R9, and R10
Implementation
PostgreSQL Extension
“temporal_tables”
1
RDC.table RDC.hist_table
RDC.table_with_history (view)
2 3
• functions
• triggers
• query audit
tables
Data Reproducibility Workflow
Bonus Feature: Determine if Change
Occurred
Future Developments
 Develop a process for sharing Query PID with
researchers in an automated way
 Resolve Query PIDs to a landing page with
Query metadata
 Implement research reproducibility requirements
in other systems as possible
Outcomes and Support
Obtained Outcomes
 Implemented WG recommendations
 Engaged with other i2b2 adoptees
(Harvard, Nationwide Children’s Hospital)
Dissemination
 Poster presentation (Harvard U, July 2016)
 Scientific manuscript based on our proof of concept to
AMIA TBI/CRI 2017 conference
 Sharing the code with the community
Return on Investment (ROI) -
Estimated
 20 hours to complete 1 study
 $150/hr (unsubsidized)
 $3000 per study
 115 research studies per year
 14 replication studies
Funding Support
 MacArthur Foundation 2016 Adoption Seeds program Foundation
through a sub-contract with Research Data Alliance
 Washington University Institute of Clinical and Translational Sciences
 NIH CTSA Grant Number UL1TR000448 and UL1TR000448-09S1
 Siteman Cancer Center at Washington University
 NIH/NCI Grant P30 CA091842-14

Center for Biomedical Informatics @WUSTL
Teams for Reproducible Research
WashU CBMI Research Reproducibility
Resources
Adoption of Data Citation Outcomes
by BCO-DMO
Cynthia Chandler, Adam Shepherd
 An existing repository (http://bco-dmo.org/)
 Marine research data curation since 2006
 Faced with new challenges, but no new funding
 e.g. data publication practices to support citation
 Used the outcomes from the RDA Data Citation
Working Group to improve data publication and citation
services
A story of success enabled by RDA
https://www.rd-alliance.org/groups/data-citation-wg.html
 BCO-DMO is a thematic, domain-specific repository
funded by NSF Ocean Sciences and Polar Programs
 BCO-DMO curated data are
- Served: http://bco-dmo.org (URLs, URIs)
- Published: at an Institutional Repository (CrossRef DOI)
http://dx.doi.org/10.1575/1912/4847
- Archived: at NCEI, a US National Data Center
http://data.nodc.noaa.gov/cgi-bin/iso?
id=gov.noaa.nodc:0078575
BCO-DMO Curated Data
for Linked Data URI: http://lod.bco-dmo.org/id/dataset/3046
BCO-DMO Dataset Landing Page (Mar ‘16)
Initial Architecture Design Considerations (Jan 2016)
Modified Architecture (March 2016)
BCO-DMO Data Publication System
Components
BCO-DMO publishes data to WHOAS and a DOI is assigned.
The BCO-DMO architecture now supports data versioning.
BCO-DMO Data Citation System Components
BCO-DMO Data Set Landing Page
BCO-DMO Data Set Landing Page
Linked to Publication via DOI
New Capabilities … BCO-DMO becoming a DataONE Member Node
https://search.dataone.org/
New Capabilities … BCO-DMO Data Set Citation
 To the Data Citation Working Group for their efforts
https://www.rd-alliance.org/groups/data-citation-wg.html
 RDA US for funding this adoption project
 TIMELINE:
- Redesign/protoype completed by 1 June 2016
- New citation recommendation by 1 Sep 2016
- Report out at RDA P8 (Denver, CO) September 2016
- Final report by 1 December 2016
Cyndy Chandler @cynDC42 @bcodmo
ORCID: 0000-0003-2129-1647 cchandler@whoi.edu
Thank you …
 Removed these to reduce talk to 10-15 minutes
EXTRA SLIDES
Adoption of Data Citation Outputs
 Evaluation
- Evaluate recommendations (done December 2015)
- Try implementation in existing BCO-DMO architecture
(work began 4 April 2016)
 Trial
- BCO-DMO: R1-11 fit well with current architecture; R12
doable; test as part of DataONE node membership; R13-
14 are consistent with Linked Data approach to data
publication and sharing
NOTE: adoption grant received from RDA US (April 2016)
RDA Data Citation (DC) of evolving
data
 DC goals: to create identification mechanisms that:
- allow us to identify and cite arbitrary views of data, from a single
record to an entire data set in a precise, machine-actionable
manner
- allow us to cite and retrieve that data as it existed at a certain
point in time, whether the database is static or highly dynamic
 DC outcomes: 14 recommendations and associated
documentation
- ensuring that data are stored in a versioned and timestamped
manner
- identifying data sets by storing and assigning persistent
identifiers (PIDs) to timestamped queries that can be re-
executed against the timestamped data store
RDA Data Citation WG
Recommendations
»» Data Versioning: For retrieving earlier
states of datasets the data need to be
versioned. Markers shall indicate inserts,
updates and deletes of data in the database.
»» Data Timestamping: Ensure that operations
on data are timestamped, i.e. any additions,
deletions are marked with a
timestamp.
»» Data Identification: The data used shall be
identified via a PID pointing to a time-stamped
query, resolving to a landing page.
Oct 2015 version w/ 14 recommendations
DC WG chairs: Andreas Rauber, Ari Asmi,
Dieter van Uytvanck
procedure: when a BCO-DMO data set is updated …
A copy of the previous version is preserved
Request a DOI for the new version of data
Publish data, and create new landing page for new
version of data, with new DOI assigned
BCO-DMO database has links to all versions of the data
(archived and published)
Both archive and published dataset landing pages have
links back to best version of full dataset at BCO-DMO
BCO-DMO data set landing page displays links to all
archived and published versions
New capability
(implemented)
 Extended description of recommendations
 Altman and Crosas. 2013. “Evolution of Data Citation …”
 CODATA-ICSTI 2013. “Out of cite, out of mind”
 FORCE11 https://www.force11.org/about/mission-and-guiding-principles
 R. E. Duerr, et al. “On the utility of identification schemes for digital
earth science data”, ESI, 2011.
REFERENCES
Vermont Monitoring Cooperative
James Duncan, Jennifer Pontius
VMC
Implementation of Dynamic Data Citation
James Duncan and Jennifer Pontius
9/16/2016
james.duncan@uvm.edu, www.uvm.edu/vmc
Ecosystem Monitoring
Collaborator Network
Data Archive, Access and
Integration
Many Disciplines, Many
Contributors
VMC houses any data related to forest
ecosystem condition, regardless of affiliation or
discipline
Why We Need It
 Continually evolving
datasets
 Some errors not caught
till next field season
 Frequent reporting and
publishing
Dynamic Data Citation –
Features Needed
 Light footprint on database resources
 Works on top of existing catalog and metadata
 Works in an institutionally managed PHP/MySQL
environment
 User-driven control of what quantity of change
constitutes a version
 Integration with management portal
 Track granular changes in data
User Workflow to Date
 Modify a dataset
- changes tracked
- original data table unchanged
 Commit to version, assign name
- computes result hash (table pkid, col
names, first col data) and query hash
- updates data table to new state
- formalizes version
 Restore previous version
- creates new version table from current
data table state,
- walks it back using stored SQL.
- Garbage collected after a period of time
User Workflow, still to come
 Web-based data editing
validation
 DOI minting integration
 Public display
 Subsetting workflow
 Other methods of data
modification?
 Upgrade to rest of
system
V1.1
V1.2
V1.3 Query 1
Unsaved
DOI:
DOI:
DOI?
Technical Details
Version
PID
Datase
t ID
Version
Name
Version
ID
Person
ID
Query
Hash
Result
Hash
Timesta
mp
23456 3525 Version
1.5
3 …. …. ….
23574 3525 Unsaved -1 …. NULL ….
Step ID Version
PID
Forward Backward Order
983245 23574 DELETE
FROM…
INSERT INTO… 1
983245 23574 UPDATE
SET
site=“Winhall
”…
UPDATE SET
site=“Lye
Brook”…
2
Version Info Table
Step Tracking Table (Child of Version Info)
Implementation Challenges and
Questions
Challenges
Large updates
Re-creation of past versions, in terms of garbage
collection and storage
Questions
Query uniqueness checking and query normalization
Efficient but effective results hashing strategies
Linear progression of data, versus branching network
Acknowledgments
 Adoption seed funding - MacArthur Foundation and the
Research Data Alliance
 The US Forest Service State and Private Forestry
program for core operational funding of the VMC
 Fran Berman, Yolanda Meleco and the other adopters
who have been sharing their experiences.
 All the VMC cooperators that contribute
Thank you!
ARGO
Justin Buck, Helen Glaves
BODC
WG Data Citation: Adoption
meeting
Argo DOI pilot
Justin Buck, National Oceanography Centre (UK),
juck@bodc.ac.uk
Thierry Carval, Ifremer (France), thierry.carval@ofremer.fr
Thomas Loubrieu, Ifremer (France), thomas.loubrieu@ifremer.fr
Frederic Merceur, Ifremer (France), frederic.merceur@ifremer.fr
RDA P8 2016, Denver, 16th
September 2016
300+ citations per
year
How to cite Argo data at a given point in time?
Possible with a single DOI?
Argo data system (simplified)
Raw
data
Data Assembly Centres (DAC)
(national level, 10 of, RT processing and QC) GTS
(+GTSPP
archive)
Global data assembly Centres (GDAC)
(2 of, mirrored, ftp access,
repository of NetCDF files, no
version control, latest version served)
Data users
Delayed mode
quality control
Key associated milestones
 2014 – Introduction of dynamic data into the DataCite
metadata schema
- https://schema.datacite.org/
- Snapshots were used as an interim solution for Argo
 2015 – RDA recommendations on evolving data:
- https://rd-alliance.org/system/files/RDA-DC-
Recommendations_151020.pdf
- Legacy GDAC architecture does not permit full implementation
How to apply a DOI for Argo
Obs time
Statetime
++
+
++
+++ +
+
++
+
+
+
++
+
+ + +
+
+
++
+++
Obs time
Statetime
++
+
++
+++ +
+
++
+
+
+
++
+
+ + +
+
+
++
+++
Obs time
Statetime
++
+
++
+++ +
+
++
+
+
+
++
+
+ + +
+
+
++
+++
Archive (collection of snapshots/granules)
To cite a particular snapshot one can potentially cite a time slice of an archive i.e. the
snapshot at a given point in time.
New single DOI
Argo (2000). Argo float data and metadata from Global Data Assembly
Centre (Argo GDAC). SEANOE. http://doi.org/10.17882/42182
# key used to identify snapshot
# key used to identify snapshot
http://www.seanoe.org/data/00311/42182/#45420
Step towards RDA recommendation
Archive snapshots enables R1 and R2 at monthly granularity
R1 – Data Versioning R2 – Timestamping
Argo Pilot effectively uses predetermined referencing of snapshots removing
the need for requirements R3 to R7. # keys are PIDs for the snapshots and
have associated citation texts.
R3 – Query Store Facilities R4 – Query Uniqueness
R5 – Stable Sorting R6 – Result Set Verification
R7 – Query Timestamping R8 – Query PID
R9 – Store Query R10 – Automated Citation Texts
SEANOE landing page architecture means R11 and R12 effectively met
R11 – Landing Page R12 – Machine Actionability
Final two requirements untested at this stage
R13 – Technology Migration R14 – Migration Verification
Summary
 There is now a single DOI for Argo
- Takes account of legacy GDAC architecture
- Monthly temporal granularity
- Enables both reproducible research and simplifies the tracking
of citations
- ‘#’ rather than ‘?’ in identifier takes account of current DOI
resolution architecture
 Extensible to other observing systems such as
OceanSITES and EGO
 The concept allows for different subsets of Argo data
e.g. ocean basins, Bio-Argo data
Progress on Data Citation within
VAMDC
C.M. Zwölf and VAMDC Consortium
carlo-maria.zwolf@obspm.fr
From RDA Data Citation
Recommendations to new paradigms
for citing data from VAMDC
C.M. Zwölf and VAMDC consortium
RDA 8th
Plenary - Denver
VAMDC
Single and
unique
access to
heterogeneou
s A+M
Databases
VAMDC
Single and
unique
access to
heterogeneou
s A+M
Databases
Federates 29 heterogeneous databases
http://portal.vamdc.org/
The “V” of VAMDC stands for Virtual in the
sense that the e-infrastructure does not
contain data. The infrastructure is a
wrapping for exposing in a unified way a set
of heterogeneous databases.
The consortium is politically organized
around a Memorandum of understanding
(15 international members have signed the
MoU, 1 November 2014)
High quality scientific data come from
different Physical/Chemical Communities
Provides data producers with a large
dissemination platform
Remove bottleneck between data-
producers and wide body of users
The Virtual Atomic and Molecular Data Centre
Existing
Independent
A+M
database
Existing
Independent
A+M
database
The VAMDC infrastructure technical architecture
VAMDC wrapping
layer  VAMDC
Node
VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Existing
Independent
A+M
database
Resource
registered
into
The VAMDC infrastructure technical architecture
Standard LayerStandard Layer
Accept queries submitted in
standard grammar
Accept queries submitted in
standard grammar
Results formatted into
standard XML file (XSAMS)
Results formatted into
standard XML file (XSAMS)
Registry
Layer
Registry
Layer
VAMDC wrapping
layer  VAMDC
Node
VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Existing
Independent
A+M
database
Resource
registered
into
Unique
A+M query
Unique
A+M query
Asks for available
resources
The VAMDC infrastructure technical architecture
VAMDC
Clients
(dispatch
query on all
the
registered
resources)
•Portal
•SpecView
•SpectCol
VAMDC
Clients
(dispatch
query on all
the
registered
resources)
•Portal
•SpecView
•SpectCol
Set of
standard
files
Set of
standard
files
Standard LayerStandard Layer
Accept queries submitted in
standard grammar
Accept queries submitted in
standard grammar
Results formatted into
standard XML file (XSAMS)
Results formatted into
standard XML file (XSAMS)
Registry
Layer
Registry
Layer
VAMDC wrapping
layer  VAMDC
Node
VAMDC wrapping
layer  VAMDC
Node
Existing
Independent
A+M
database
Existing
Independent
A+M
database
Resource
registered
into
Unique
A+M query
Unique
A+M query
Asks for available
resources
The VAMDC infrastructure technical architecture
VAMDC
Clients
(dispatch
query on all
the
registered
resources)
•Portal
•SpecView
•SpectCol
VAMDC
Clients
(dispatch
query on all
the
registered
resources)
•Portal
•SpecView
•SpectCol
Set of
standard
files
Set of
standard
files
Standard LayerStandard Layer
Accept queries submitted in
standard grammar
Accept queries submitted in
standard grammar
Results formatted into
standard XML file (XSAMS)
Results formatted into
standard XML file (XSAMS)
Registry
Layer
Registry
Layer
• VAMDC is agnostic about the local data storage strategy on each node.
• Each node implements the access/query/result protocols.
• There is no central management system.
• Decisions about technical evolutions are made by consensus in Consortium.
 It is both technical and political challenging to implement the WG recommendations.
Let us implement the recommendation!!
Tagging and versioning dataTagging and versioning data
The problem is more anthropological than technical…The problem is more anthropological than technical…
What does it really mean
data citation?
What does it really mean
data citation?
Let us implement the recommendation!!
Tagging and versioning dataTagging and versioning data
The problem is more anthropological than technical…The problem is more anthropological than technical…
What does it really mean
data citation?
What does it really mean
data citation?
Let us implement the recommendation!!
Tagging and versioning dataTagging and versioning data
The problem is more anthropological than technical…The problem is more anthropological than technical…
We see technically how to do thatWe see technically how to do that
But each data provider defines differently what a
dataset is.
But each data provider defines differently what a
dataset is.
Naturally it is the dataset (A+M data have no
meaning outside this given context)
Naturally it is the dataset (A+M data have no
meaning outside this given context)
Ok, but What is the data granularity for tagging?Ok, but What is the data granularity for tagging?
What does it really mean
data citation?
What does it really mean
data citation?
Let us implement the recommendation!!
Tagging and versioning dataTagging and versioning data
The problem is more anthropological than technical…The problem is more anthropological than technical…
We see technically how to do thatWe see technically how to do that
But each data provider defines differently what a
dataset is.
But each data provider defines differently what a
dataset is.
Naturally it is the dataset (A+M data have no
meaning outside this given context)
Naturally it is the dataset (A+M data have no
meaning outside this given context)
Ok, but What is the data granularity for tagging?Ok, but What is the data granularity for tagging?
What does it really mean
data citation?
What does it really mean
data citation?
Everyone knows what it is!Everyone knows what it is!
Yes, but everyone has its own definitionYes, but everyone has its own definition
RDA cite databases record or output files.
(an extracted data file may have an H-factor)
RDA cite databases record or output files.
(an extracted data file may have an H-factor)
VAMDC cite all the papers used for compiling
the content of a given output file.
VAMDC cite all the papers used for compiling
the content of a given output file.
Let us implement the recommendation!!
Tagging versions of dataTagging versions of data
Implementation will be an overlay to the standard / output layer, thus
independent from any specific data-node
Implementation will be an overlay to the standard / output layer, thus
independent from any specific data-node
Twolayers
mechanisms
Twolayers
mechanisms
1  Fine grained granularity:
Evolution of XSAMS output
standard for tracking data
modifications
1  Fine grained granularity:
Evolution of XSAMS output
standard for tracking data
modifications
2  Coarse grained
granularity:
At each data modification to a
given data node, the version
of the Data-Node changes
2  Coarse grained
granularity:
At each data modification to a
given data node, the version
of the Data-Node changes
With the second mechanism we know that
something changed : in other words, we
know that the result of an identical query
may be different from one version to the
other. The detail of which data changed is
accessible using the first mechanisms.
With the second mechanism we know that
something changed : in other words, we
know that the result of an identical query
may be different from one version to the
other. The detail of which data changed is
accessible using the first mechanisms.
Let us implement the recommendation!!
Query StoreQuery StoreTagging versions of dataTagging versions of data
Implementation will be an overlay to the standard / output layer, thus
independent from any specific data-node
Implementation will be an overlay to the standard / output layer, thus
independent from any specific data-node
Twolayers
mechanisms
Twolayers
mechanisms
1  Fine grained granularity:
Evolution of XSAMS output
standard for tracking data
modifications
1  Fine grained granularity:
Evolution of XSAMS output
standard for tracking data
modifications
2  Coarse grained
granularity:
At each data modification to a
given data node, the version
of the Data-Node changes
2  Coarse grained
granularity:
At each data modification to a
given data node, the version
of the Data-Node changes
With the second mechanism we know that
something changed : in other words, we
know that the result of an identical query
may be different from one version to the
other. The detail of which data changed is
accessible using the first mechanisms.
With the second mechanism we know that
something changed : in other words, we
know that the result of an identical query
may be different from one version to the
other. The detail of which data changed is
accessible using the first mechanisms.
Is built over the versioning of DataIs built over the versioning of Data
Is plugged over the existing
VAMDC data-extraction
mechanisms.
Is plugged over the existing
VAMDC data-extraction
mechanisms.
Due to the distributed VAMDC
architecture, the Query Store
architecture is similar to a log-service.
Due to the distributed VAMDC
architecture, the Query Store
architecture is similar to a log-service.
Data-Versioning: overview of the fine grained
mechanisms
OutputXSAMSfile
Radiative process
1
Radiative process
1
Radiative process
2
Radiative process
2
We adopted a change of paradigm (weak structuration):
Radiative process
n
Radiative process
n
…
Collisional process 1Collisional process 1
Collisional process 2Collisional process 2
Collisional process
m
Collisional process
m
…
Energy State 1Energy State 1
Energy State 2Energy State 2
Energy State pEnergy State p
……
Element 1Element 1
Element 2Element 2
Element qElement q
OutputXSAMSfile
Radiative process
1
Radiative process
1
Radiative process
2
Radiative process
2
Radiative process
n
Radiative process
n
…
Collisional process 1Collisional process 1
Collisional process 2Collisional process 2
Collisional process
m
Collisional process
m
…
Energy State 1Energy State 1
Energy State 2Energy State 2
Energy State pEnergy State p
……
Element 1Element 1
Element 2Element 2
Element qElement q
Version 1
(tagged according
to infrastructure
state & updates)
Version 2
(tagged according
to infrastructure
state & updates)
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration):
OutputXSAMSfile
Radiative process
1
Radiative process
1
Radiative process
2
Radiative process
2
Radiative process
n
Radiative process
n
…
Collisional process 1Collisional process 1
Collisional process 2Collisional process 2
Collisional process
m
Collisional process
m
…
Energy State 1Energy State 1
Energy State 2Energy State 2
Energy State pEnergy State p
……
Element 1Element 1
Element 2Element 2
Element qElement q
Version 1
(tagged according
to infrastructure
state & updates)
Version 2
(tagged according
to infrastructure
state & updates)
Data-Versioning: overview of the fine grained
mechanisms
We adopted a change of paradigm (weak structuration):
This approach has several advantages:
•It solves the data tagging granularity problem
•It is independent from what is considered a dataset
•The new files are compliant with old libraries & processing programs
• We add a new feature, an overlay to the existing structure
• We induce a structuration, without changing the structure (weak
structuration)
Data-Versioning: overview of the fine grained
mechanisms
We adopted a change of paradigm (weak structuration):
We adopted a change of paradigms:
This approach has several advantages:
•It solves the data tagging granularity problem
•It is independent from what is considered a dataset
•The new files are compliant with old libraries & processing programs
• We add a new feature, an overlay to the existing structure
• We induce a structuration, without changing the structure (weak
structuration)
Technical details described in
New model for datasets citation and extraction reproducibility in VAMDC,
C.M. Zwölf, N. Moreau, M.-L. Dubernet,
In press J. Mol. Spectrosc. (2016),
http://dx.doi.org/10.1016/j.jms.2016.04.009
Arxiv version: https://arxiv.org/abs/1606.00405
Technical details described in
New model for datasets citation and extraction reproducibility in VAMDC,
C.M. Zwölf, N. Moreau, M.-L. Dubernet,
In press J. Mol. Spectrosc. (2016),
http://dx.doi.org/10.1016/j.jms.2016.04.009
Arxiv version: https://arxiv.org/abs/1606.00405
Data-Versioning: overview of the fine grained mechanisms
Let us focus on the query store:
The difficulty we have to cope with:
•Handle a query store in a distributed environment (RDA did not design
it for these configurations).
•Integrate the query store with the existing VAMDC infrastructure.
The difficulty we have to cope with:
•Handle a query store in a distributed environment (RDA did not design
it for these configurations).
•Integrate the query store with the existing VAMDC infrastructure.
Let us focus on the query store:
The difficulty we have to cope with:
•Handle a query store in a distributed environment (RDA did not design
it for these configurations).
•Integrate the query store with the existing VAMDC infrastructure.
The difficulty we have to cope with:
•Handle a query store in a distributed environment (RDA did not design
it for these configurations).
•Integrate the query store with the existing VAMDC infrastructure.
The implementation of the query store is the goal of a joint
collaboration between VAMDC and RDA-Europe.
•Development started during spring 2016.
•Final product released during 2017.
The implementation of the query store is the goal of a joint
collaboration between VAMDC and RDA-Europe.
•Development started during spring 2016.
•Final product released during 2017.
Let us focus on the query store:
The difficulty we have to cope with:
•Handle a query store in a distributed environment (RDA did not design
it for these configurations).
•Integrate the query store with the existing VAMDC infrastructure.
The difficulty we have to cope with:
•Handle a query store in a distributed environment (RDA did not design
it for these configurations).
•Integrate the query store with the existing VAMDC infrastructure.
The implementation of the query store is the goal of a jointly
collaboration between VAMDC and RDA-Europe.
•Development started during spring 2016.
•Final product released during 2017.
The implementation of the query store is the goal of a jointly
collaboration between VAMDC and RDA-Europe.
•Development started during spring 2016.
•Final product released during 2017.
Collaboration with Elsevier for embedding the VAMDC query store into
the pages displaying the digital version of papers.
Designing technical solution for
•Paper / data linking at the paper submission (for authors)
•Paper / data linking at the paper display (for readers)
Collaboration with Elsevier for embedding the VAMDC query store into
the pages displaying the digital version of papers.
Designing technical solution for
•Paper / data linking at the paper submission (for authors)
•Paper / data linking at the paper display (for readers)
Data extraction procedure
Let us focus on the query store:
VAMDC
portal
(query
interface)
VAMDC
portal
(query
interface)
VAMDC
infrastructur
e
VAMDC
infrastructur
e
Query
VAMD portal
(result part)
VAMD portal
(result part)
Computed
response
Access to the
output data file
Access to the
output data file
Digital Unique
Identifier
associated to
the current
extraction
Digital Unique
Identifier
associated to
the current
extraction
Submitting
query
Sketching the functioning – From the final-user point of view:
Data extraction procedure
Let us focus on the query store:
VAMDC
portal
(query
interface)
VAMDC
portal
(query
interface)
VAMDC
infrastructur
e
VAMDC
infrastructur
e
Query
VAMD portal
(result part)
VAMD portal
(result part)
Computed
response
Access to the
output data file
Access to the
output data file
Digital Unique
Identifier
associated to
the current
extraction
Digital Unique
Identifier
associated to
the current
extraction
Submitting
query
Resolves
Landing
Page
Landing
Page
The original queryThe original query
Date & time where query was
processed
Date & time where query was
processed
Version of the infrastructure when
the query was processed
Version of the infrastructure when
the query was processed
List of publications needed for
answering the query
List of publications needed for
answering the query
When supported (by the VAMDC
federated DB): retrieve the output
data-file as it was computed
(query re-execution)
When supported (by the VAMDC
federated DB): retrieve the output
data-file as it was computed
(query re-execution)
Query
Metadata
Query
Metadata
QueryStore
Sketching the functioning – From the final-user point of view:
Data extraction procedure
Let us focus on the query store:
VAMDC
portal
(query
interface)
VAMDC
portal
(query
interface)
VAMDC
infrastructur
e
VAMDC
infrastructur
e
Query
VAMD portal
(result part)
VAMD portal
(result part)
Computed
response
Access to the
output data file
Access to the
output data file
Digital Unique
Identifier
associated to
the current
extraction
Digital Unique
Identifier
associated to
the current
extraction
Submitting
query
Resolves
Landing
Page
Landing
Page
The original queryThe original query
Date & time where query was
processed
Date & time where query was
processed
Version of the infrastructure when
the query was processed
Version of the infrastructure when
the query was processed
List of publications needed for
answering the query
List of publications needed for
answering the query
When supported (by the VAMDC
federated DB): retrieve the output
data-file as it was computed
(query re-execution)
When supported (by the VAMDC
federated DB): retrieve the output
data-file as it was computed
(query re-execution)
Query
Metadata
Query
Metadata
QueryStore
Manage queries
(with
authorisation/auth
entication)
Manage queries
(with
authorisation/auth
entication)
Group arbitrary set
of queries (with
related DUI) and
assign them a DOI
to use in
publications
Group arbitrary set
of queries (with
related DUI) and
assign them a DOI
to use in
publications
UseDOIin
papers
Sketching the functioning – From the final-user point of view:
Data extraction procedure
Let us focus on the query store:
Sketching the functioning – From the final-user point of view:
VAMDC
portal
(query
interface)
VAMDC
portal
(query
interface)
VAMDC
infrastructur
e
VAMDC
infrastructur
e
Query
VAMD portal
(result part)
VAMD portal
(result part)
Computed
response
Access to the
output data file
Access to the
output data file
Digital Unique
Identifier
associated to
the current
extraction
Digital Unique
Identifier
associated to
the current
extraction
Submitting
query
Resolves
Landing
Page
Landing
Page
The original queryThe original query
Date & time where query was
processed
Date & time where query was
processed
Version of the infrastructure when
the query was processed
Version of the infrastructure when
the query was processed
List of publications needed for
answering the query
List of publications needed for
answering the query
When supported (by the VAMDC
federated DB): retrieve the output
data-file as it was computed
(query re-execution)
When supported (by the VAMDC
federated DB): retrieve the output
data-file as it was computed
(query re-execution)
Query
Metadata
Query
Metadata
QueryStore
Manage queries
(with
authorisation/auth
entication)
Manage queries
(with
authorisation/auth
entication)
Group arbitrary set
of queries (with
related DUI) and
assign them a DOI
to use in
publications
Group arbitrary set
of queries (with
related DUI) and
assign them a DOI
to use in
publications
UseDOIin
papers
Editors may follow the citation pipeline :
credit delegation applies
Let us focus on the query store:
Sketching the functioning – Technical internal point of view:
VAMDC
Node
VAMDC
Node
Query Store Listener
Service
Query Store Listener
Service
Notification
1  When a node receives a user query, it
notifies to the Listener Service the following
information:
•The identity of the user (optional)
•The used client software
•The identifier of the node receiving the query
•The version (with related timestamp) of the node receiving
the query
•The version of the output standard used by the node for
replying the results
•The query submitted by the user
•The link to the result data.
1  When a node receives a user query, it
notifies to the Listener Service the following
information:
•The identity of the user (optional)
•The used client software
•The identifier of the node receiving the query
•The version (with related timestamp) of the node receiving
the query
•The version of the output standard used by the node for
replying the results
•The query submitted by the user
•The link to the result data.
Let us focus on the query store:
Sketching the functioning – Technical internal point of view:
VAMDC
Node
VAMDC
Node
Query Store Listener
Service
Query Store Listener
Service
2  For each received notification,
the listener check if it exists an already
existing query:
•Having the same Query
•Having the same node version
•Submitted to the same node and having the
same node version
2  For each received notification,
the listener check if it exists an already
existing query:
•Having the same Query
•Having the same node version
•Submitted to the same node and having the
same node version
Notification
Let us focus on the query store:
Sketching the functioning – Technical internal point of view:
VAMDC
Node
VAMDC
Node
Query Store Listener
Service
Query Store Listener
Service
2  For each received notification,
the listener check if it exists an already
existing query:
•Having the same Query
•Having the same node version
•Submitted to the same node and having the
same node version
2  For each received notification,
the listener check if it exists an already
existing query:
•Having the same Query
•Having the same node version
•Submitted to the same node and having the
same node version
• Provides the Query with a unique
time-stamped identifier
• Following the link, get data and
process them for extracting relevant
metadata (ex. Bibtex of references
used for compiling the output file)
• Store these relevant metadata
• If identity of user is provided, we
associate the user with the Query ID
• Provides the Query with a unique
time-stamped identifier
• Following the link, get data and
process them for extracting relevant
metadata (ex. Bibtex of references
used for compiling the output file)
• Store these relevant metadata
• If identity of user is provided, we
associate the user with the Query ID
Notification
• We get the already existing unique
identifier and (incrementally)
associate the new query timestamp
(and if provided the identifier of the
user) with the query Identifier
• We get the already existing unique
identifier and (incrementally)
associate the new query timestamp
(and if provided the identifier of the
user) with the query Identifier
existing
Notexisting
Let us focus on the query store:
Sketching the functioning – Technical internal point of view:
VAMDC
Node
VAMDC
Node
Query Store Listener
Service
Query Store Listener
Service
Notification
Remark on query uniqueness:
•The query language supported by the VAMDC infrastructure is VSS2 (VAMDC
SQL Subset 2,
http://vamdc.eu/documents/standards/queryLanguage/vss2.html).
•We are working on a specific VSS2 parser (based on Antlr) which should identify,
from queries expressed in different ways, the ones that are semantically identical
•We are designing this analyzer as an independent module, hoping to extend it to
all SQL.
Final remarks:
• Our aims:
• Provide the VAMDC infrastructure with an operational query store
• Share our experience with other data-providers
• Provide data-providers with a set of libraries/tools/methods for an easy
implementation of a query store.
• We will try to build a generic query store (i.e. using generic software
blocks)
UK Riverflow Archive
Matthew Fry
mfry@ceh.ac.uk
UK National River Flow Archive
 Curation and dissemination of regulatory
river flow data for research and other
access
 Data used for significant research
outputs, and a large number of citations
annually
 Updated annually but also regular
revision of entire flow series through time
(e.g. stations resurveyed)
 Internal auditing, but history is not
exposed to users
UK National River Flow Archive
 ORACLE Relational database
 Time series and metadata tables
 ~20M daily flow records, + monthly / daily catchment
rainfall series
 Metadata (station history, owners, catchment soils /
geology, etc.)
 Total size of ~5GB
 Time series tables automatically audited,
- But reconstruction is complex
 Users generally download simple files
 But public API is in development / R-NRFA package is out
there
 Fortunately all access is via single codeset
Our data citation requirements
 Cannot currently cite whole dataset
 Allow citation a subset of the data, as it was at the time
 Fit with current workflow / update schedule, and
requirements for reproducibility
 Fit with current (file download) and future (API) user
practices
 Resilient to gradual or fundamental changes in
technologies used
 Allow tracking of citations in publications
Options for versioning / citation
 “Regulate” queries:
- limitations on service provided
 Enable any query to be timestamped / cited /
reproducible:
- does not readily allow verification (e.g. checksum) of queries
(R7), or storage of queries (R9)
 Manage which queries can be citable:
- limitation on publishing workflow?
Versioning / citation solution
 Automated archiving of entire database – version controlled scripts
defining tables, creating / populating archived tables (largely
complete)
 Fits in with data workflow – public / dev versions – this only works
because we have irregular / occasional updates
 Simplification of the data model (complete)
 API development (being undertaken independently of dynamic
citation requirements):
- allows subsetting of dataset in a number of ways – initially simply
- need to implement versioning (started) to ensure will cope with changes to
data structures
 Fit to dynamic data citation recommendations?
- Largely
- Need to address mechanism for users to request / create citable version of
a query
 Resource required: estimated ~2 person months

More Related Content

What's hot

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsAndreas Schreiber
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paperrestassure
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET Journal
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
An overview of data warehousing and OLAP technology
An overview of  data warehousing and OLAP technology An overview of  data warehousing and OLAP technology
An overview of data warehousing and OLAP technology Nikhatfatima16
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
 
Storing and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudStoring and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudSteffen Staab
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterEDB
 
The CIARD RING , a global directory of datasets for agriculture, by Valeria P...
The CIARD RING, a global directory of datasets for agriculture, by Valeria P...The CIARD RING, a global directory of datasets for agriculture, by Valeria P...
The CIARD RING , a global directory of datasets for agriculture, by Valeria P...CIARD Movement
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Olaf Hartig
 

What's hot (20)

Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
An overview of data warehousing and OLAP technology
An overview of  data warehousing and OLAP technology An overview of  data warehousing and OLAP technology
An overview of data warehousing and OLAP technology
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
 
Datawarehouse olap olam
Datawarehouse olap olamDatawarehouse olap olam
Datawarehouse olap olam
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Storing and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudStoring and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the Cloud
 
Making Postgres Central in Your Data Center
Making Postgres Central in Your Data CenterMaking Postgres Central in Your Data Center
Making Postgres Central in Your Data Center
 
The CIARD RING , a global directory of datasets for agriculture, by Valeria P...
The CIARD RING, a global directory of datasets for agriculture, by Valeria P...The CIARD RING, a global directory of datasets for agriculture, by Valeria P...
The CIARD RING , a global directory of datasets for agriculture, by Valeria P...
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
 

Viewers also liked

Efficient and effective: can we combine both to realize high-value, open, sca...
Efficient and effective: can we combine both to realize high-value, open, sca...Efficient and effective: can we combine both to realize high-value, open, sca...
Efficient and effective: can we combine both to realize high-value, open, sca...Research Data Alliance
 
Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Research Data Alliance
 
Soil Research Data Policies, Data availability and Access, and the Interopera...
Soil Research Data Policies, Data availability and Access, and the Interopera...Soil Research Data Policies, Data availability and Access, and the Interopera...
Soil Research Data Policies, Data availability and Access, and the Interopera...Research Data Alliance
 
Federico garcía lorca
Federico garcía lorcaFederico garcía lorca
Federico garcía lorcamartagar78
 
ARTE CBTEMPORANEO
ARTE CBTEMPORANEOARTE CBTEMPORANEO
ARTE CBTEMPORANEOSasha Cano
 
EntretienBernard Laporte dans Midol
EntretienBernard Laporte dans MidolEntretienBernard Laporte dans Midol
EntretienBernard Laporte dans MidolMarc De Jongy
 
CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...
CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...
CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...Research Data Alliance
 
صيانة وترميم وكالة الغوري
صيانة وترميم وكالة الغوريصيانة وترميم وكالة الغوري
صيانة وترميم وكالة الغوريfreemadoo
 
D4Science Data infrastructure: a facilitator for a FAIR data management
D4Science Data infrastructure: a facilitator for a FAIR data managementD4Science Data infrastructure: a facilitator for a FAIR data management
D4Science Data infrastructure: a facilitator for a FAIR data managementResearch Data Alliance
 
Experience Certificate SNC
Experience Certificate SNCExperience Certificate SNC
Experience Certificate SNCMurad Nimri
 
Facing data sharing in a heterogeneous research community: lights and shadows...
Facing data sharing in a heterogeneous research community: lights and shadows...Facing data sharing in a heterogeneous research community: lights and shadows...
Facing data sharing in a heterogeneous research community: lights and shadows...Research Data Alliance
 
SoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningSoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningResearch Data Alliance
 

Viewers also liked (20)

Efficient and effective: can we combine both to realize high-value, open, sca...
Efficient and effective: can we combine both to realize high-value, open, sca...Efficient and effective: can we combine both to realize high-value, open, sca...
Efficient and effective: can we combine both to realize high-value, open, sca...
 
Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...
 
Soil Research Data Policies, Data availability and Access, and the Interopera...
Soil Research Data Policies, Data availability and Access, and the Interopera...Soil Research Data Policies, Data availability and Access, and the Interopera...
Soil Research Data Policies, Data availability and Access, and the Interopera...
 
Federico garcía lorca
Federico garcía lorcaFederico garcía lorca
Federico garcía lorca
 
ARTE CBTEMPORANEO
ARTE CBTEMPORANEOARTE CBTEMPORANEO
ARTE CBTEMPORANEO
 
Mate grado 6°
Mate grado 6°Mate grado 6°
Mate grado 6°
 
Mate grado 11o
Mate grado 11oMate grado 11o
Mate grado 11o
 
T armstrong w2_ip_
T armstrong w2_ip_T armstrong w2_ip_
T armstrong w2_ip_
 
EntretienBernard Laporte dans Midol
EntretienBernard Laporte dans MidolEntretienBernard Laporte dans Midol
EntretienBernard Laporte dans Midol
 
CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...
CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...
CoBRA guideline : a tool to facilitate sharing, reuse, and reproducibility of...
 
صيانة وترميم وكالة الغوري
صيانة وترميم وكالة الغوريصيانة وترميم وكالة الغوري
صيانة وترميم وكالة الغوري
 
Data curator: who is s/he?
Data curator: who is s/he?Data curator: who is s/he?
Data curator: who is s/he?
 
D4Science Data infrastructure: a facilitator for a FAIR data management
D4Science Data infrastructure: a facilitator for a FAIR data managementD4Science Data infrastructure: a facilitator for a FAIR data management
D4Science Data infrastructure: a facilitator for a FAIR data management
 
"Cool" metadata for FAIR data
"Cool" metadata for FAIR data"Cool" metadata for FAIR data
"Cool" metadata for FAIR data
 
Experience Certificate SNC
Experience Certificate SNCExperience Certificate SNC
Experience Certificate SNC
 
Facing data sharing in a heterogeneous research community: lights and shadows...
Facing data sharing in a heterogeneous research community: lights and shadows...Facing data sharing in a heterogeneous research community: lights and shadows...
Facing data sharing in a heterogeneous research community: lights and shadows...
 
Chikungunya
ChikungunyaChikungunya
Chikungunya
 
Mushroom business plan
Mushroom business planMushroom business plan
Mushroom business plan
 
Whole brain optical imaging
Whole brain optical imagingWhole brain optical imaging
Whole brain optical imaging
 
SoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningSoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social Mining
 

Similar to Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation

Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...LEARN Project
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and MiningDaniel JACOB
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkEamonn Maguire
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar
 
Dw Concepts
Dw ConceptsDw Concepts
Dw Conceptsdataware
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSBowenDing4
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Make your data great now
Make your data great nowMake your data great now
Make your data great nowDaniel JACOB
 
Using Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauUsing Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauPerforce
 
Labmatrix
LabmatrixLabmatrix
Labmatrixjwppz
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven
 

Similar to Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation (20)

Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
HEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 TalkHEPData Open Repositories 2016 Talk
HEPData Open Repositories 2016 Talk
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Dw Concepts
Dw ConceptsDw Concepts
Dw Concepts
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 
Using Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauUsing Perforce Data in Development at Tableau
Using Perforce Data in Development at Tableau
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
Labmatrix
LabmatrixLabmatrix
Labmatrix
 
LECTURE4.ppt
LECTURE4.pptLECTURE4.ppt
LECTURE4.ppt
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
 

More from Research Data Alliance

The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsResearch Data Alliance
 
The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsResearch Data Alliance
 
RDA Value for Infrastructure Providers
RDA Value for Infrastructure ProvidersRDA Value for Infrastructure Providers
RDA Value for Infrastructure ProvidersResearch Data Alliance
 
The Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing ResearchThe Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing ResearchResearch Data Alliance
 

More from Research Data Alliance (20)

RDA in a Nutshell - September 2020
RDA in a Nutshell - September 2020RDA in a Nutshell - September 2020
RDA in a Nutshell - September 2020
 
RDA in a Nutshell - August 2020
RDA in a Nutshell - August 2020RDA in a Nutshell - August 2020
RDA in a Nutshell - August 2020
 
RDA in a Nutshell - July 2020
RDA in a Nutshell - July 2020RDA in a Nutshell - July 2020
RDA in a Nutshell - July 2020
 
RDA in a Nutshell - June 2020
RDA in a Nutshell - June 2020RDA in a Nutshell - June 2020
RDA in a Nutshell - June 2020
 
RDA in a Nutshell - May 2020
RDA in a Nutshell - May 2020RDA in a Nutshell - May 2020
RDA in a Nutshell - May 2020
 
RDA in a Nutshell - April 2020
RDA in a Nutshell - April 2020RDA in a Nutshell - April 2020
RDA in a Nutshell - April 2020
 
RDA in a Nutshell - March 2020
RDA in a Nutshell - March 2020RDA in a Nutshell - March 2020
RDA in a Nutshell - March 2020
 
RDA in a Nutshell - February 2020
RDA in a Nutshell - February 2020RDA in a Nutshell - February 2020
RDA in a Nutshell - February 2020
 
RDA in a Nutshell - January 2020
RDA in a Nutshell - January 2020RDA in a Nutshell - January 2020
RDA in a Nutshell - January 2020
 
Rda in a Nutshell - December 2019
Rda in a Nutshell - December 2019Rda in a Nutshell - December 2019
Rda in a Nutshell - December 2019
 
Rda in a Nutshell - November 2019
Rda in a Nutshell - November 2019Rda in a Nutshell - November 2019
Rda in a Nutshell - November 2019
 
RDA in a Nutshell - October 2019
RDA in a Nutshell - October 2019RDA in a Nutshell - October 2019
RDA in a Nutshell - October 2019
 
The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to Individuals
 
The Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to IndividualsThe Value of the Research Data Alliance to Individuals
The Value of the Research Data Alliance to Individuals
 
RDA Value for Infrastructure Providers
RDA Value for Infrastructure ProvidersRDA Value for Infrastructure Providers
RDA Value for Infrastructure Providers
 
Rda in a nutshell september 2019
Rda in a nutshell september 2019Rda in a nutshell september 2019
Rda in a nutshell september 2019
 
The Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing ResearchThe Value of the Rda Value for Organisations Performing Research
The Value of the Rda Value for Organisations Performing Research
 
RDA Value for Libraries
RDA Value for LibrariesRDA Value for Libraries
RDA Value for Libraries
 
The Value of the RDA for Funders
The Value of the RDA for FundersThe Value of the RDA for Funders
The Value of the RDA for Funders
 
Rda in a nutshell august 2019
Rda in a nutshell august 2019Rda in a nutshell august 2019
Rda in a nutshell august 2019
 

Recently uploaded

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation

  • 1. Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation Andreas Rauber Vienna University of Technology Favoritenstr. 9-11/188 1040 Vienna, Austria rauber@ifs.tuwien.ac.at http://ww.ifs.tuwien.ac.at/~andi
  • 2. Motivation  Research data is fundamental for science - Results are based on data - Data serves as input for workflows and experiments - Data is the source for graphs and visualisations in publications  Data is needed for Reproducibility - Repeat experiments - Verify / compare results  Need to provide specific data set - Service for data repositories  Put data in data repository, Assign PID (DOI, Ark, URI, …) Make is accessible  done? Source: open.wien.gv.at https://commons.wikimedia.org/w/index.php?curid=30978545
  • 3. Identification of Dynamic Data  Usually, datasets have to be static - Fixed set of data, no changes: no corrections to errors, no new data being added  But: (research) data is dynamic - Adding new data, correcting errors, enhancing data quality, … - Changes sometimes highly dynamic, at irregular intervals  Current approaches - Identifying entire data stream, without any versioning - Using “accessed at” date - “Artificial” versioning by identifying batches of data (e.g. annual), aggregating changes into releases (time-delayed!)  Would like to identify precisely the data as it existed at a specific point in time
  • 4. Granularity of Subsets  What about the granularity of data to be identified? - Enormous amounts of CSV data - Researchers use specific subsets of data - Need to identify precisely the subset used  Current approaches - Storing a copy of subset as used in study -> scalability - Citing entire dataset, providing textual description of subset -> imprecise (ambiguity) - Storing list of record identifiers in subset -> scalability, not for arbitrary subsets (e.g. when not entire record selected)  Would like to be able to identify precisely the subset of (dynamic) data used in a process
  • 5. What we do NOT want…  Common approaches to data management… (from PhD Comics: A Story Told in File Names, 28.5.2010) Source: http://www.phdcomics.com/comics.php?f=1323
  • 6.  Research Data Alliance  WG on Data Citation: Making Dynamic Data Citeable  March 2014 – September 2015 - Concentrating on the problems of large, dynamic (changing) datasets  Final version presented Sep 2015 at P7 in Paris, France  Endorsed September 2016 at P8 in Denver, CO https://www.rd-alliance.org/groups/data-citation-wg.html RDA WG Data Citation
  • 7. RDA WGDC - Solution  We have ‒ Data & some means of access („query“)
  • 8. RDA WGDC - Solution  We have ‒ Data & some means of access („query“)  Make ‒ Data: time-stamped and versioned ‒ Query: timestamped
  • 9. RDA WGDC - Solution  We have ‒ Data & some means of access („query“)  Make ‒ Data: time-stamped and versioned ‒ Query: timestamped  Data Citation: ‒ Store query ‒ Assign the PID to the timestamped query (which, dynamically, leads to the data)  Access: Re-execute query on versioned data according to timestamp  Dynamic Data Citation: Dynamic data & dynamic citation of data
  • 10. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets − Data (package, access API, …) − PID (e.g. DOI) (Query is time-stamped and stored) − Hash value computed over the data for local storage − Recommended citation text (e.g. BibTeX)  PID resolves to landing page − Provides detailed metadata, link to parent data set, subset,… − Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation − Query is re-executed against time-stamped and versioned DB − Results as above are returned  Query store aggregates data usage
  • 11. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets − Data (package, access API, …) − PID (e.g. DOI) (Query is time-stamped and stored) − Hash value computed over the data for local storage − Recommended citation text (e.g. BibTeX)  PID resolves to landing page − Provides detailed metadata, link to parent data set, subset,… − Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation − Query is re-executed against time-stamped and versioned DB − Results as above are returned  Query store aggregates data usage Note: query string provides excellentNote: query string provides excellent provenance information on the data set!provenance information on the data set!
  • 12. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets − Data (package, access API, …) − PID (e.g. DOI) (Query is time-stamped and stored) − Hash value computed over the data for local storage − Recommended citation text (e.g. BibTeX)  PID resolves to landing page − Provides detailed metadata, link to parent data set, subset,… − Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation − Query is re-executed against time-stamped and versioned DB − Results as above are returned  Query store aggregates data usage Note: query string provides excellentNote: query string provides excellent provenance information on the data set!provenance information on the data set! This is an important advantage overThis is an important advantage over traditional approaches relying on, e.g.traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!storing a list of identifiers/DB dump!!!
  • 13. Data Citation – Deployment  Researcher uses workbench to identify subset of data  Upon executing selection („download“) user gets − Data (package, access API, …) − PID (e.g. DOI) (Query is time-stamped and stored) − Hash value computed over the data for local storage − Recommended citation text (e.g. BibTeX)  PID resolves to landing page − Provides detailed metadata, link to parent data set, subset,… − Option to retrieve original data OR current version OR changes  Upon activating PID associated with a data citation − Query is re-executed against time-stamped and versioned DB − Results as above are returned  Query store aggregates data usage Note: query string provides excellentNote: query string provides excellent provenance information on the data set!provenance information on the data set! This is an important advantage overThis is an important advantage over traditional approaches relying on, e.g.traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!storing a list of identifiers/DB dump!!! Identify which parts of the data are used.Identify which parts of the data are used. If data changes, identify which queriesIf data changes, identify which queries (studies) are affected(studies) are affected
  • 14. Data Citation – Output  14 Recommendations grouped into 4 phases: - Preparing data and query store - Persistently identifying specific data sets - Resolving PIDs - Upon modifications to the data infrastructure  2-page flyer https:// rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.htm  More detailed report: IEEE TCDL 2016 http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE- TCDL-DC-2016_paper_1.pdf
  • 15. Data Citation – Recommendations Preparing Data & Query Store - R1 – Data Versioning - R2 – Timestamping - R3 – Query Store When Data should be persisted - R4 – Query Uniqueness - R5 – Stable Sorting - R6 – Result Set Verification - R7 – Query Timestamping - R8 – Query PID - R9 – Store Query - R10 – Citation Text When Resolving a PID - R11 – Landing Page - R12 – Machine Actionability Upon Modifications to the Data Infrastructure - R13 – Technology Migration - R14 – Migration Verification
  • 16. Data Citation – Recommendations A) Preparing the Data and the Query Store R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future
  • 17. Data Citation – Recommendations A) Preparing the Data and the Query Store R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future NoteNote:: •R1 & R2R1 & R2 are already pretty much standard inare already pretty much standard in many (RDBMS-) research databasesmany (RDBMS-) research databases •Different ways to implementDifferent ways to implement •A bit more challenging for some data typesA bit more challenging for some data types (XML, LOD, …)(XML, LOD, …) NoteNote:: •R3:R3: query store usually pretty small, even forquery store usually pretty small, even for extremely high query volumesextremely high query volumes
  • 18. Data Citation – Recommendations B) Persistently Identify Specific Data sets (1/2) When a data set should be persisted: R4 – Query Uniqueness: Re-write the query to a normalized form so that identical queries can be detected. Compute a checksum of the normalized query to efficiently detect identical queries R5 – Stable Sorting: Ensure an unambiguous sorting of the records in the data set R6 – Result Set Verification: Compute fixity information/checksum of the query result set to enable verification of the correctness of a result upon re-execution R7 – Query Timestamping: Assign a timestamp to the query based on the last update to the entire database (or the last update to the selection of data affected by the query or the query execution time). This allows retrieving the data as it existed at query time
  • 19. Data Citation – Recommendations B) Persistently Identify Specific Data sets (2/2) When a data set should be persisted: R8 – Query PID: Assign a new PID to the query if either the query is new or if the result set returned from an earlier identical query is different due to changes in the data. Otherwise, return the existing PID R9 – Store Query: Store query and metadata (e.g. PID, original and normalized query, query & result set checksum, timestamp, superset PID, data set description and other) in the query store R10 – Citation Text: Provide citation text including the PID in the format prevalent in the designated community to lower barrier for citing data.
  • 20. Data Citation – Recommendations C) Resolving PIDs and Retrieving Data R11 – Landing Page: Make the PIDs resolve to a human readable landing page that provides the data (via query re-execution) and metadata, including a link to the superset (PID of the data source) and citation text snippet R12 – Machine Actionability: Provide an API / machine actionable landing page to access metadata and data via query re-execution
  • 21. Data Citation – Recommendations D) Upon Modifications to the Data Infrastructure R13 – Technology Migration: When data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), migrate also the queries and associated checksums R14 – Migration Verification: Verify successful data and query migration, ensuring that queries can be re-executed correctly
  • 22. Pilot Adopters  Several adopters - Different types of data, different settings, … - CSV & SQL reference implementation (SBA/TUW)  Pilots: - Biomedical BigData Sharing, Electronic Health Records (Center for Biomedical Informatics, Washington Univ. in St. Louis) - Marine Research Data Biological & Chemical Oceanography Data Management Office (BCO-DMO) - Vermont Monitoring Cooperative: Forest Ecosystem Monitoring - ARGO Boy Network, British Oceanographic Data Centre (BODC) - Virtual Atomic and Molecular Data Centre (VAMDC) - UK Riverflow Riverflow Archive, Centre for Ecology and Hydrology
  • 23. WG Data Citation Pilot CBMI @ WUSTL Cynthia Hudson Vitale, Leslie McIntosh, Snehil Gupta Washington University in St.Luis
  • 26. R1 and R2 Implementation PostgreSQL Extension “temporal_tables” c1 c2 c3 RDC.table sys_period c1 c2 c3 sys_period RDC.hist_table* *stores history of data changes 1 2 3 triggers
  • 27. Return on Investment (ROI) - Estimated  20 hours to complete 1 study  $150/hr (unsubsidized)  $3000 per study  115 research studies per year  14 replication studies
  • 28. Adoption of Data Citation Outcomes by BCO-DMO Cynthia Chandler, Adam Shepherd
  • 29.  An existing repository (http://bco-dmo.org/)  Marine research data curation since 2006  Faced with new challenges, but no new funding  e.g. data publication practices to support citation  Used the outcomes from the RDA Data Citation Working Group to improve data publication and citation services A story of success enabled by RDA https://www.rd-alliance.org/groups/data-citation-wg.html
  • 30. Adoption of Data Citation Outputs  Evaluation - Evaluate recommendations (done December 2015) - Try implementation in existing BCO-DMO architecture (work began 4 April 2016)  Trial - BCO-DMO: R1-11 fit well with current architecture; R12 doable; test as part of DataONE node membership; R13-14 are consistent with Linked Data approach to data publication and sharing  Timeline: - Redesign/protoype completed by 1 June 2016 - New citation recommendation by 1 Sep 2016 - Report out at RDA P8 (Denver, CO) September 2016 - Final report by 1 December 2016
  • 31. Vermont Monitoring Cooperative James Duncan, Jennifer Pontius VMC
  • 32. Ecosystem Monitoring Collaborator Network Data Archive, Access and Integration
  • 33. USERWORKFLOWTO DATE  Modify a dataset  changes tracked  original data table unchanged  Commit to version, assign name  computes result hash (table pkid, col names, first col data) and query hash  updates data table to new state  formalizes version  Restore previous version  creates new version table from current data table state,  walks it back using stored SQL.  Garbage collected after a period of time
  • 34. UK Riverflow Archive Matthew Fry mfry@ceh.ac.uk
  • 35. Global Water Information Interest Group meeting RDA 8th Plenary, 15th September 2016, Denver UK National River Flow Archive • ORACLE Relational database • Time series and metadata tables • ~20M daily flow records, + monthly / daily catchment rainfall series • Metadata (station history, owners, catchment soils / geology, etc.) • Total size of ~5GB • Time series tables automatically audited, • But reconstruction is complex • Users generally download simple files • But public API is in development / R-NRFA package is out there • Fortunately all access is via single codeset
  • 36. Global Water Information Interest Group meeting RDA 8th Plenary, 15th September 2016, Denver Versioning / citation solution • Automated archiving of entire database – version controlled scripts defining tables, creating / populating archived tables (largely complete) • Fits in with data workflow – public / dev versions – this only works because we have irregular / occasional updates • Simplification of the data model (complete) • API development (being undertaken independently of dynamic citation requirements): • allows subsetting of dataset in a number of ways – initially simply • need to implement versioning (started) to ensure will cope with changes to data structures • Fit to dynamic data citation recommendations? • Largely • Need to address mechanism for users to request / create citable version of a query • Resource required: estimated ~2 person months
  • 37. Reference Implementation for CSV Data (and SQL) Stefan Pröll, SBA
  • 38. 38 CSV/SQL Reference Implementation 1  Reference Implementation available on Github https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype  Upload interface -->Upload CSV files  Migrate CSV file into RDBMS −Generate table structure, identify primary key −Add metadata columns for versioning (transparent) −Add indices (transparent)  Dynamic data: upload new version of file −Versioned update / delete existing records in RDBMS  Access interface −Track subset creation −Store queries -> PID + Landing Page Barrymieny
  • 39. 39 CSV/Git Reference Implementation 2  Based on Git only (no SQL database)  Upload CSV files to Git repository  SQL-style queries operate on CSV file via API  Data versioning with Git  Store scripts versioned as well  Make subset creation reproducible
  • 40. 40 CSV/Git Reference Implementation 2  Stefan Pröll, Christoph Meixner, Andreas Rauber Precise Data Identification Services for Long Tail Research Data. Proceedings of the intl. Conference on Preservation of Digital Objects (iPRES2016), Oct. 3-6 2016, Bern, Switzerland.  Source at Github: https://github.com/datascience/ RDA-WGDC-CSV-Data-Citation-Prototype  Videos:  Login: https://youtu.be/EnraIwbQfM0  Upload: https://youtu.be/xJruifX9E2U  Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ  Resolver: https://youtu.be/FHsvjsUMiiY  Update: https://youtu.be/cMZ0xoZHUyI
  • 41. Benefits  Retrieval of precise subset with low storage overhead  Subset as cited or as it is now (including e.g. corrections)  Query provides provenance information  Query store supports analysis of data usage  Checksums support verification  Same principles applicable across all settings - Small and large data - Static and dynamic data - Different data representations (RDBMS, CSV, XML, LOD, …)  Would work also for more sophisticated/general transformations on data beyond select/project
  • 43. Reference Implementation for CSV Data (and SQL) Stefan Pröll, SBA
  • 44.  RDA recommendations implemented in data infrastructures  Required adaptions - Introduce versioning, if not already in place - Capture sub-setting process (queries) - Implement dedicated query store to store queries - A bit of additional functionality (query re-writing, hash functions, …)  Done! ? - “Big data”, database driven - Well-defined interfaces - Trained experts available - “Complex, only for professional research infrastructures” ? Large Scale Research Settings
  • 45. Long Tail Research Data Big data, well organized, often used and cited Less well organized, non-standardised no dedicated infrastructure “Dark data” Amount of data sets Data set size [1] Heidorn, P. Bryan. "Shedding light on the dark data in the long tail of science." Library Trends 57.2 (2008): 280-299.
  • 46. Prototype Implementations  Solution for small-scale data - CSV files, no “expensive” infrastructure, low overhead 2 Reference implementations :  Git based Prototypes: widely used versioning system - A) Using separate folders - B) Using branches  MySQL based Prototype: - C) Migrates CSV data into relational database  Data backend responsible for versioning data sets  Subsets are created with scripts or queries via API or Web Interface  Transparent to user: always CSV
  • 47. CSV Reference Implementation 2 Git Implementation 1 Upload CSV files to Git repository (versioning) Subsets created via scripting language (e.g. R)  Select rows/columns, sort, returns CSV + metadata file  Metdata file with script parameters stored in Git  (Scripts stored in Git as well) PID assigned to metadata file  Use Git to retrieve proper data set version and re-execute script on retrieved file
  • 48. Git-based Prototype Git Implementation 2 Addresses issues - common commit history, branching data Using Git branching model: Orphaned branches for queries and data - Keeps commit history clean - Allows merging of data files Web interface for queries (CSV2JDBC) Use commit hash for identification - Assigned PID hashed with SHA1 - Use hash of PID as filename (ensure permissible characters)
  • 49. Git-based Prototype Step 1: Select a CSV file in the repository Step 2: Create a subset with a SQL query (on CSV data) Step 3: Store the query script and metadata Step 4: Re-Execute!  Prototype: https://github.com/Mercynary/recitable
  • 50. MySQL-based Prototype MySQL Prototype Data upload - User uploads a CSV file into the system Data migration from CSV file into RDBMS - Generate table structure - Add metadata columns (versioning) - Add indices (performance) Dynamic data - Insert, update and delete records - Events are recorded with a timestamp Subset creation - User selects columns, filters and sorts records in web interface - System traces the selection process - Exports CSV Barrymieny
  • 51. MySQL-based Prototype  Source at Github: ‒ https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype  Videos: ‒ Login: https://youtu.be/EnraIwbQfM0 ‒ Upload: https://youtu.be/xJruifX9E2U ‒ Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ ‒ Resolver: https://youtu.be/FHsvjsUMiiY ‒ Update: https://youtu.be/cMZ0xoZHUyI
  • 52. CSV Reference Implementation 2  Stefan Pröll, Christoph Meixner, Andreas Rauber Precise Data Identification Services for Long Tail Research Data. Proceedings of the intl. Conference on Preservation of Digital Objects (iPRES2016), Oct. 3-6 2016, Bern, Switzerland.  Source at Github: https://github.com/datascience/ RDA-WGDC-CSV-Data-Citation-Prototype  Videos:  Login: https://youtu.be/EnraIwbQfM0  Upload: https://youtu.be/xJruifX9E2U  Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ  Resolver: https://youtu.be/FHsvjsUMiiY  Update: https://youtu.be/cMZ0xoZHUyI
  • 53. WG Data Citation Pilot CBMI @ WUSTL Cynthia Hudson Vitale, Leslie McIntosh, Snehil Gupta Washington University in St.Luis
  • 54. Moving Biomedical Big Data Sharing Forward An adoption of the RDA Data Citation of Evolving Data Recommendation to Electronic Health Records Leslie McIntosh, PHD, MPH RDA P8 Cynthia Hudson Vitale, MA Denver, USA September 2016 @mcintold @cynhudson
  • 56.  Director, Center for Biomedical Informatics  Data Services Coordinator  Leslie McIntosh, PHD, MPH   Cynthia Hudson Vitale, MA 
  • 57. BDaaS Biomedical Data as a Service Researchers Data Broker i2b2 Application Biomedical Data Repository
  • 58. Move some of the responsibility of reproducibility Biomedical Researcher Biomedical Pipeline
  • 60. Biomedical Adoption Project Goals Implement RDA Data Citation WG recommendation to local Washington U i2b2 Engage other i2b2 community adoptees Contribute source code back to i2b2 community
  • 61. RDA Data Citation WG Recommendations R1: Data Versioning R2: Data Timestamping R3, R9: Query Store R7: Query Timestamping R8: Query PID R10: Query Citation
  • 62. Internal Implementation Requirements  Scalable  Available for PostgreSQL  Actively supported  Easy to maintain  Easy for data brokers to use
  • 64. R1 and R2 Implementation PostgreSQL Extension “temporal_tables” c1 c2 c3 RDC.table sys_period c1 c2 c3 sys_period RDC.hist_table* *stores history of data changes 1 2 3 triggers
  • 65. ETL Incrementals Source Data Update? … sys_period 2016-9-9 00:00, NULL RDC.table Old Data … sys_period 2016-9-8 00:00, 2016-9-9 00:00 RDC.hist_table Insert? … sys_period 2016-9-9 00:00, NULL RDC.table
  • 66. R3, R7, R8, R9, and R10 Implementation PostgreSQL Extension “temporal_tables” 1 RDC.table RDC.hist_table RDC.table_with_history (view) 2 3 • functions • triggers • query audit tables
  • 68. Bonus Feature: Determine if Change Occurred
  • 69. Future Developments  Develop a process for sharing Query PID with researchers in an automated way  Resolve Query PIDs to a landing page with Query metadata  Implement research reproducibility requirements in other systems as possible
  • 71. Obtained Outcomes  Implemented WG recommendations  Engaged with other i2b2 adoptees (Harvard, Nationwide Children’s Hospital)
  • 72. Dissemination  Poster presentation (Harvard U, July 2016)  Scientific manuscript based on our proof of concept to AMIA TBI/CRI 2017 conference  Sharing the code with the community
  • 73. Return on Investment (ROI) - Estimated  20 hours to complete 1 study  $150/hr (unsubsidized)  $3000 per study  115 research studies per year  14 replication studies
  • 74. Funding Support  MacArthur Foundation 2016 Adoption Seeds program Foundation through a sub-contract with Research Data Alliance  Washington University Institute of Clinical and Translational Sciences  NIH CTSA Grant Number UL1TR000448 and UL1TR000448-09S1  Siteman Cancer Center at Washington University  NIH/NCI Grant P30 CA091842-14 
  • 75. Center for Biomedical Informatics @WUSTL Teams for Reproducible Research
  • 76. WashU CBMI Research Reproducibility Resources
  • 77. Adoption of Data Citation Outcomes by BCO-DMO Cynthia Chandler, Adam Shepherd
  • 78.  An existing repository (http://bco-dmo.org/)  Marine research data curation since 2006  Faced with new challenges, but no new funding  e.g. data publication practices to support citation  Used the outcomes from the RDA Data Citation Working Group to improve data publication and citation services A story of success enabled by RDA https://www.rd-alliance.org/groups/data-citation-wg.html
  • 79.  BCO-DMO is a thematic, domain-specific repository funded by NSF Ocean Sciences and Polar Programs  BCO-DMO curated data are - Served: http://bco-dmo.org (URLs, URIs) - Published: at an Institutional Repository (CrossRef DOI) http://dx.doi.org/10.1575/1912/4847 - Archived: at NCEI, a US National Data Center http://data.nodc.noaa.gov/cgi-bin/iso? id=gov.noaa.nodc:0078575 BCO-DMO Curated Data for Linked Data URI: http://lod.bco-dmo.org/id/dataset/3046
  • 80. BCO-DMO Dataset Landing Page (Mar ‘16)
  • 81. Initial Architecture Design Considerations (Jan 2016)
  • 83. BCO-DMO Data Publication System Components BCO-DMO publishes data to WHOAS and a DOI is assigned. The BCO-DMO architecture now supports data versioning.
  • 84. BCO-DMO Data Citation System Components
  • 85. BCO-DMO Data Set Landing Page
  • 86.
  • 87. BCO-DMO Data Set Landing Page
  • 89. New Capabilities … BCO-DMO becoming a DataONE Member Node https://search.dataone.org/
  • 90. New Capabilities … BCO-DMO Data Set Citation
  • 91.  To the Data Citation Working Group for their efforts https://www.rd-alliance.org/groups/data-citation-wg.html  RDA US for funding this adoption project  TIMELINE: - Redesign/protoype completed by 1 June 2016 - New citation recommendation by 1 Sep 2016 - Report out at RDA P8 (Denver, CO) September 2016 - Final report by 1 December 2016 Cyndy Chandler @cynDC42 @bcodmo ORCID: 0000-0003-2129-1647 cchandler@whoi.edu Thank you …
  • 92.  Removed these to reduce talk to 10-15 minutes EXTRA SLIDES
  • 93. Adoption of Data Citation Outputs  Evaluation - Evaluate recommendations (done December 2015) - Try implementation in existing BCO-DMO architecture (work began 4 April 2016)  Trial - BCO-DMO: R1-11 fit well with current architecture; R12 doable; test as part of DataONE node membership; R13- 14 are consistent with Linked Data approach to data publication and sharing NOTE: adoption grant received from RDA US (April 2016)
  • 94. RDA Data Citation (DC) of evolving data  DC goals: to create identification mechanisms that: - allow us to identify and cite arbitrary views of data, from a single record to an entire data set in a precise, machine-actionable manner - allow us to cite and retrieve that data as it existed at a certain point in time, whether the database is static or highly dynamic  DC outcomes: 14 recommendations and associated documentation - ensuring that data are stored in a versioned and timestamped manner - identifying data sets by storing and assigning persistent identifiers (PIDs) to timestamped queries that can be re- executed against the timestamped data store
  • 95. RDA Data Citation WG Recommendations »» Data Versioning: For retrieving earlier states of datasets the data need to be versioned. Markers shall indicate inserts, updates and deletes of data in the database. »» Data Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp. »» Data Identification: The data used shall be identified via a PID pointing to a time-stamped query, resolving to a landing page. Oct 2015 version w/ 14 recommendations DC WG chairs: Andreas Rauber, Ari Asmi, Dieter van Uytvanck
  • 96. procedure: when a BCO-DMO data set is updated … A copy of the previous version is preserved Request a DOI for the new version of data Publish data, and create new landing page for new version of data, with new DOI assigned BCO-DMO database has links to all versions of the data (archived and published) Both archive and published dataset landing pages have links back to best version of full dataset at BCO-DMO BCO-DMO data set landing page displays links to all archived and published versions New capability (implemented)
  • 97.  Extended description of recommendations  Altman and Crosas. 2013. “Evolution of Data Citation …”  CODATA-ICSTI 2013. “Out of cite, out of mind”  FORCE11 https://www.force11.org/about/mission-and-guiding-principles  R. E. Duerr, et al. “On the utility of identification schemes for digital earth science data”, ESI, 2011. REFERENCES
  • 98. Vermont Monitoring Cooperative James Duncan, Jennifer Pontius VMC
  • 99. Implementation of Dynamic Data Citation James Duncan and Jennifer Pontius 9/16/2016 james.duncan@uvm.edu, www.uvm.edu/vmc
  • 100. Ecosystem Monitoring Collaborator Network Data Archive, Access and Integration
  • 101. Many Disciplines, Many Contributors VMC houses any data related to forest ecosystem condition, regardless of affiliation or discipline
  • 102. Why We Need It  Continually evolving datasets  Some errors not caught till next field season  Frequent reporting and publishing
  • 103. Dynamic Data Citation – Features Needed  Light footprint on database resources  Works on top of existing catalog and metadata  Works in an institutionally managed PHP/MySQL environment  User-driven control of what quantity of change constitutes a version  Integration with management portal  Track granular changes in data
  • 104. User Workflow to Date  Modify a dataset - changes tracked - original data table unchanged  Commit to version, assign name - computes result hash (table pkid, col names, first col data) and query hash - updates data table to new state - formalizes version  Restore previous version - creates new version table from current data table state, - walks it back using stored SQL. - Garbage collected after a period of time
  • 105. User Workflow, still to come  Web-based data editing validation  DOI minting integration  Public display  Subsetting workflow  Other methods of data modification?  Upgrade to rest of system V1.1 V1.2 V1.3 Query 1 Unsaved DOI: DOI: DOI?
  • 106. Technical Details Version PID Datase t ID Version Name Version ID Person ID Query Hash Result Hash Timesta mp 23456 3525 Version 1.5 3 …. …. …. 23574 3525 Unsaved -1 …. NULL …. Step ID Version PID Forward Backward Order 983245 23574 DELETE FROM… INSERT INTO… 1 983245 23574 UPDATE SET site=“Winhall ”… UPDATE SET site=“Lye Brook”… 2 Version Info Table Step Tracking Table (Child of Version Info)
  • 107. Implementation Challenges and Questions Challenges Large updates Re-creation of past versions, in terms of garbage collection and storage Questions Query uniqueness checking and query normalization Efficient but effective results hashing strategies Linear progression of data, versus branching network
  • 108. Acknowledgments  Adoption seed funding - MacArthur Foundation and the Research Data Alliance  The US Forest Service State and Private Forestry program for core operational funding of the VMC  Fran Berman, Yolanda Meleco and the other adopters who have been sharing their experiences.  All the VMC cooperators that contribute
  • 110. ARGO Justin Buck, Helen Glaves BODC
  • 111. WG Data Citation: Adoption meeting Argo DOI pilot Justin Buck, National Oceanography Centre (UK), juck@bodc.ac.uk Thierry Carval, Ifremer (France), thierry.carval@ofremer.fr Thomas Loubrieu, Ifremer (France), thomas.loubrieu@ifremer.fr Frederic Merceur, Ifremer (France), frederic.merceur@ifremer.fr RDA P8 2016, Denver, 16th September 2016
  • 112. 300+ citations per year How to cite Argo data at a given point in time? Possible with a single DOI?
  • 113. Argo data system (simplified) Raw data Data Assembly Centres (DAC) (national level, 10 of, RT processing and QC) GTS (+GTSPP archive) Global data assembly Centres (GDAC) (2 of, mirrored, ftp access, repository of NetCDF files, no version control, latest version served) Data users Delayed mode quality control
  • 114. Key associated milestones  2014 – Introduction of dynamic data into the DataCite metadata schema - https://schema.datacite.org/ - Snapshots were used as an interim solution for Argo  2015 – RDA recommendations on evolving data: - https://rd-alliance.org/system/files/RDA-DC- Recommendations_151020.pdf - Legacy GDAC architecture does not permit full implementation
  • 115. How to apply a DOI for Argo Obs time Statetime ++ + ++ +++ + + ++ + + + ++ + + + + + + ++ +++ Obs time Statetime ++ + ++ +++ + + ++ + + + ++ + + + + + + ++ +++ Obs time Statetime ++ + ++ +++ + + ++ + + + ++ + + + + + + ++ +++ Archive (collection of snapshots/granules) To cite a particular snapshot one can potentially cite a time slice of an archive i.e. the snapshot at a given point in time.
  • 116. New single DOI Argo (2000). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE. http://doi.org/10.17882/42182
  • 117. # key used to identify snapshot
  • 118. # key used to identify snapshot http://www.seanoe.org/data/00311/42182/#45420
  • 119. Step towards RDA recommendation Archive snapshots enables R1 and R2 at monthly granularity R1 – Data Versioning R2 – Timestamping Argo Pilot effectively uses predetermined referencing of snapshots removing the need for requirements R3 to R7. # keys are PIDs for the snapshots and have associated citation texts. R3 – Query Store Facilities R4 – Query Uniqueness R5 – Stable Sorting R6 – Result Set Verification R7 – Query Timestamping R8 – Query PID R9 – Store Query R10 – Automated Citation Texts SEANOE landing page architecture means R11 and R12 effectively met R11 – Landing Page R12 – Machine Actionability Final two requirements untested at this stage R13 – Technology Migration R14 – Migration Verification
  • 120. Summary  There is now a single DOI for Argo - Takes account of legacy GDAC architecture - Monthly temporal granularity - Enables both reproducible research and simplifies the tracking of citations - ‘#’ rather than ‘?’ in identifier takes account of current DOI resolution architecture  Extensible to other observing systems such as OceanSITES and EGO  The concept allows for different subsets of Argo data e.g. ocean basins, Bio-Argo data
  • 121. Progress on Data Citation within VAMDC C.M. Zwölf and VAMDC Consortium carlo-maria.zwolf@obspm.fr
  • 122. From RDA Data Citation Recommendations to new paradigms for citing data from VAMDC C.M. Zwölf and VAMDC consortium RDA 8th Plenary - Denver
  • 123. VAMDC Single and unique access to heterogeneou s A+M Databases VAMDC Single and unique access to heterogeneou s A+M Databases Federates 29 heterogeneous databases http://portal.vamdc.org/ The “V” of VAMDC stands for Virtual in the sense that the e-infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases. The consortium is politically organized around a Memorandum of understanding (15 international members have signed the MoU, 1 November 2014) High quality scientific data come from different Physical/Chemical Communities Provides data producers with a large dissemination platform Remove bottleneck between data- producers and wide body of users The Virtual Atomic and Molecular Data Centre
  • 125. VAMDC wrapping layer  VAMDC Node VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Resource registered into The VAMDC infrastructure technical architecture Standard LayerStandard Layer Accept queries submitted in standard grammar Accept queries submitted in standard grammar Results formatted into standard XML file (XSAMS) Results formatted into standard XML file (XSAMS) Registry Layer Registry Layer
  • 126. VAMDC wrapping layer  VAMDC Node VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Resource registered into Unique A+M query Unique A+M query Asks for available resources The VAMDC infrastructure technical architecture VAMDC Clients (dispatch query on all the registered resources) •Portal •SpecView •SpectCol VAMDC Clients (dispatch query on all the registered resources) •Portal •SpecView •SpectCol Set of standard files Set of standard files Standard LayerStandard Layer Accept queries submitted in standard grammar Accept queries submitted in standard grammar Results formatted into standard XML file (XSAMS) Results formatted into standard XML file (XSAMS) Registry Layer Registry Layer
  • 127. VAMDC wrapping layer  VAMDC Node VAMDC wrapping layer  VAMDC Node Existing Independent A+M database Existing Independent A+M database Resource registered into Unique A+M query Unique A+M query Asks for available resources The VAMDC infrastructure technical architecture VAMDC Clients (dispatch query on all the registered resources) •Portal •SpecView •SpectCol VAMDC Clients (dispatch query on all the registered resources) •Portal •SpecView •SpectCol Set of standard files Set of standard files Standard LayerStandard Layer Accept queries submitted in standard grammar Accept queries submitted in standard grammar Results formatted into standard XML file (XSAMS) Results formatted into standard XML file (XSAMS) Registry Layer Registry Layer • VAMDC is agnostic about the local data storage strategy on each node. • Each node implements the access/query/result protocols. • There is no central management system. • Decisions about technical evolutions are made by consensus in Consortium.  It is both technical and political challenging to implement the WG recommendations.
  • 128. Let us implement the recommendation!! Tagging and versioning dataTagging and versioning data The problem is more anthropological than technical…The problem is more anthropological than technical… What does it really mean data citation? What does it really mean data citation?
  • 129. Let us implement the recommendation!! Tagging and versioning dataTagging and versioning data The problem is more anthropological than technical…The problem is more anthropological than technical… What does it really mean data citation? What does it really mean data citation?
  • 130. Let us implement the recommendation!! Tagging and versioning dataTagging and versioning data The problem is more anthropological than technical…The problem is more anthropological than technical… We see technically how to do thatWe see technically how to do that But each data provider defines differently what a dataset is. But each data provider defines differently what a dataset is. Naturally it is the dataset (A+M data have no meaning outside this given context) Naturally it is the dataset (A+M data have no meaning outside this given context) Ok, but What is the data granularity for tagging?Ok, but What is the data granularity for tagging? What does it really mean data citation? What does it really mean data citation?
  • 131. Let us implement the recommendation!! Tagging and versioning dataTagging and versioning data The problem is more anthropological than technical…The problem is more anthropological than technical… We see technically how to do thatWe see technically how to do that But each data provider defines differently what a dataset is. But each data provider defines differently what a dataset is. Naturally it is the dataset (A+M data have no meaning outside this given context) Naturally it is the dataset (A+M data have no meaning outside this given context) Ok, but What is the data granularity for tagging?Ok, but What is the data granularity for tagging? What does it really mean data citation? What does it really mean data citation? Everyone knows what it is!Everyone knows what it is! Yes, but everyone has its own definitionYes, but everyone has its own definition RDA cite databases record or output files. (an extracted data file may have an H-factor) RDA cite databases record or output files. (an extracted data file may have an H-factor) VAMDC cite all the papers used for compiling the content of a given output file. VAMDC cite all the papers used for compiling the content of a given output file.
  • 132. Let us implement the recommendation!! Tagging versions of dataTagging versions of data Implementation will be an overlay to the standard / output layer, thus independent from any specific data-node Implementation will be an overlay to the standard / output layer, thus independent from any specific data-node Twolayers mechanisms Twolayers mechanisms 1  Fine grained granularity: Evolution of XSAMS output standard for tracking data modifications 1  Fine grained granularity: Evolution of XSAMS output standard for tracking data modifications 2  Coarse grained granularity: At each data modification to a given data node, the version of the Data-Node changes 2  Coarse grained granularity: At each data modification to a given data node, the version of the Data-Node changes With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms. With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms.
  • 133. Let us implement the recommendation!! Query StoreQuery StoreTagging versions of dataTagging versions of data Implementation will be an overlay to the standard / output layer, thus independent from any specific data-node Implementation will be an overlay to the standard / output layer, thus independent from any specific data-node Twolayers mechanisms Twolayers mechanisms 1  Fine grained granularity: Evolution of XSAMS output standard for tracking data modifications 1  Fine grained granularity: Evolution of XSAMS output standard for tracking data modifications 2  Coarse grained granularity: At each data modification to a given data node, the version of the Data-Node changes 2  Coarse grained granularity: At each data modification to a given data node, the version of the Data-Node changes With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms. With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms. Is built over the versioning of DataIs built over the versioning of Data Is plugged over the existing VAMDC data-extraction mechanisms. Is plugged over the existing VAMDC data-extraction mechanisms. Due to the distributed VAMDC architecture, the Query Store architecture is similar to a log-service. Due to the distributed VAMDC architecture, the Query Store architecture is similar to a log-service.
  • 134. Data-Versioning: overview of the fine grained mechanisms OutputXSAMSfile Radiative process 1 Radiative process 1 Radiative process 2 Radiative process 2 We adopted a change of paradigm (weak structuration): Radiative process n Radiative process n … Collisional process 1Collisional process 1 Collisional process 2Collisional process 2 Collisional process m Collisional process m … Energy State 1Energy State 1 Energy State 2Energy State 2 Energy State pEnergy State p …… Element 1Element 1 Element 2Element 2 Element qElement q
  • 135. OutputXSAMSfile Radiative process 1 Radiative process 1 Radiative process 2 Radiative process 2 Radiative process n Radiative process n … Collisional process 1Collisional process 1 Collisional process 2Collisional process 2 Collisional process m Collisional process m … Energy State 1Energy State 1 Energy State 2Energy State 2 Energy State pEnergy State p …… Element 1Element 1 Element 2Element 2 Element qElement q Version 1 (tagged according to infrastructure state & updates) Version 2 (tagged according to infrastructure state & updates) Data-Versioning: overview of the fine grained mechanisms We adopted a change of paradigm (weak structuration):
  • 136. OutputXSAMSfile Radiative process 1 Radiative process 1 Radiative process 2 Radiative process 2 Radiative process n Radiative process n … Collisional process 1Collisional process 1 Collisional process 2Collisional process 2 Collisional process m Collisional process m … Energy State 1Energy State 1 Energy State 2Energy State 2 Energy State pEnergy State p …… Element 1Element 1 Element 2Element 2 Element qElement q Version 1 (tagged according to infrastructure state & updates) Version 2 (tagged according to infrastructure state & updates) Data-Versioning: overview of the fine grained mechanisms We adopted a change of paradigm (weak structuration):
  • 137. This approach has several advantages: •It solves the data tagging granularity problem •It is independent from what is considered a dataset •The new files are compliant with old libraries & processing programs • We add a new feature, an overlay to the existing structure • We induce a structuration, without changing the structure (weak structuration) Data-Versioning: overview of the fine grained mechanisms We adopted a change of paradigm (weak structuration):
  • 138. We adopted a change of paradigms: This approach has several advantages: •It solves the data tagging granularity problem •It is independent from what is considered a dataset •The new files are compliant with old libraries & processing programs • We add a new feature, an overlay to the existing structure • We induce a structuration, without changing the structure (weak structuration) Technical details described in New model for datasets citation and extraction reproducibility in VAMDC, C.M. Zwölf, N. Moreau, M.-L. Dubernet, In press J. Mol. Spectrosc. (2016), http://dx.doi.org/10.1016/j.jms.2016.04.009 Arxiv version: https://arxiv.org/abs/1606.00405 Technical details described in New model for datasets citation and extraction reproducibility in VAMDC, C.M. Zwölf, N. Moreau, M.-L. Dubernet, In press J. Mol. Spectrosc. (2016), http://dx.doi.org/10.1016/j.jms.2016.04.009 Arxiv version: https://arxiv.org/abs/1606.00405 Data-Versioning: overview of the fine grained mechanisms
  • 139. Let us focus on the query store: The difficulty we have to cope with: •Handle a query store in a distributed environment (RDA did not design it for these configurations). •Integrate the query store with the existing VAMDC infrastructure. The difficulty we have to cope with: •Handle a query store in a distributed environment (RDA did not design it for these configurations). •Integrate the query store with the existing VAMDC infrastructure.
  • 140. Let us focus on the query store: The difficulty we have to cope with: •Handle a query store in a distributed environment (RDA did not design it for these configurations). •Integrate the query store with the existing VAMDC infrastructure. The difficulty we have to cope with: •Handle a query store in a distributed environment (RDA did not design it for these configurations). •Integrate the query store with the existing VAMDC infrastructure. The implementation of the query store is the goal of a joint collaboration between VAMDC and RDA-Europe. •Development started during spring 2016. •Final product released during 2017. The implementation of the query store is the goal of a joint collaboration between VAMDC and RDA-Europe. •Development started during spring 2016. •Final product released during 2017.
  • 141. Let us focus on the query store: The difficulty we have to cope with: •Handle a query store in a distributed environment (RDA did not design it for these configurations). •Integrate the query store with the existing VAMDC infrastructure. The difficulty we have to cope with: •Handle a query store in a distributed environment (RDA did not design it for these configurations). •Integrate the query store with the existing VAMDC infrastructure. The implementation of the query store is the goal of a jointly collaboration between VAMDC and RDA-Europe. •Development started during spring 2016. •Final product released during 2017. The implementation of the query store is the goal of a jointly collaboration between VAMDC and RDA-Europe. •Development started during spring 2016. •Final product released during 2017. Collaboration with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solution for •Paper / data linking at the paper submission (for authors) •Paper / data linking at the paper display (for readers) Collaboration with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solution for •Paper / data linking at the paper submission (for authors) •Paper / data linking at the paper display (for readers)
  • 142. Data extraction procedure Let us focus on the query store: VAMDC portal (query interface) VAMDC portal (query interface) VAMDC infrastructur e VAMDC infrastructur e Query VAMD portal (result part) VAMD portal (result part) Computed response Access to the output data file Access to the output data file Digital Unique Identifier associated to the current extraction Digital Unique Identifier associated to the current extraction Submitting query Sketching the functioning – From the final-user point of view:
  • 143. Data extraction procedure Let us focus on the query store: VAMDC portal (query interface) VAMDC portal (query interface) VAMDC infrastructur e VAMDC infrastructur e Query VAMD portal (result part) VAMD portal (result part) Computed response Access to the output data file Access to the output data file Digital Unique Identifier associated to the current extraction Digital Unique Identifier associated to the current extraction Submitting query Resolves Landing Page Landing Page The original queryThe original query Date & time where query was processed Date & time where query was processed Version of the infrastructure when the query was processed Version of the infrastructure when the query was processed List of publications needed for answering the query List of publications needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) Query Metadata Query Metadata QueryStore Sketching the functioning – From the final-user point of view:
  • 144. Data extraction procedure Let us focus on the query store: VAMDC portal (query interface) VAMDC portal (query interface) VAMDC infrastructur e VAMDC infrastructur e Query VAMD portal (result part) VAMD portal (result part) Computed response Access to the output data file Access to the output data file Digital Unique Identifier associated to the current extraction Digital Unique Identifier associated to the current extraction Submitting query Resolves Landing Page Landing Page The original queryThe original query Date & time where query was processed Date & time where query was processed Version of the infrastructure when the query was processed Version of the infrastructure when the query was processed List of publications needed for answering the query List of publications needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) Query Metadata Query Metadata QueryStore Manage queries (with authorisation/auth entication) Manage queries (with authorisation/auth entication) Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publications Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publications UseDOIin papers Sketching the functioning – From the final-user point of view:
  • 145. Data extraction procedure Let us focus on the query store: Sketching the functioning – From the final-user point of view: VAMDC portal (query interface) VAMDC portal (query interface) VAMDC infrastructur e VAMDC infrastructur e Query VAMD portal (result part) VAMD portal (result part) Computed response Access to the output data file Access to the output data file Digital Unique Identifier associated to the current extraction Digital Unique Identifier associated to the current extraction Submitting query Resolves Landing Page Landing Page The original queryThe original query Date & time where query was processed Date & time where query was processed Version of the infrastructure when the query was processed Version of the infrastructure when the query was processed List of publications needed for answering the query List of publications needed for answering the query When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) When supported (by the VAMDC federated DB): retrieve the output data-file as it was computed (query re-execution) Query Metadata Query Metadata QueryStore Manage queries (with authorisation/auth entication) Manage queries (with authorisation/auth entication) Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publications Group arbitrary set of queries (with related DUI) and assign them a DOI to use in publications UseDOIin papers Editors may follow the citation pipeline : credit delegation applies
  • 146. Let us focus on the query store: Sketching the functioning – Technical internal point of view: VAMDC Node VAMDC Node Query Store Listener Service Query Store Listener Service Notification 1  When a node receives a user query, it notifies to the Listener Service the following information: •The identity of the user (optional) •The used client software •The identifier of the node receiving the query •The version (with related timestamp) of the node receiving the query •The version of the output standard used by the node for replying the results •The query submitted by the user •The link to the result data. 1  When a node receives a user query, it notifies to the Listener Service the following information: •The identity of the user (optional) •The used client software •The identifier of the node receiving the query •The version (with related timestamp) of the node receiving the query •The version of the output standard used by the node for replying the results •The query submitted by the user •The link to the result data.
  • 147. Let us focus on the query store: Sketching the functioning – Technical internal point of view: VAMDC Node VAMDC Node Query Store Listener Service Query Store Listener Service 2  For each received notification, the listener check if it exists an already existing query: •Having the same Query •Having the same node version •Submitted to the same node and having the same node version 2  For each received notification, the listener check if it exists an already existing query: •Having the same Query •Having the same node version •Submitted to the same node and having the same node version Notification
  • 148. Let us focus on the query store: Sketching the functioning – Technical internal point of view: VAMDC Node VAMDC Node Query Store Listener Service Query Store Listener Service 2  For each received notification, the listener check if it exists an already existing query: •Having the same Query •Having the same node version •Submitted to the same node and having the same node version 2  For each received notification, the listener check if it exists an already existing query: •Having the same Query •Having the same node version •Submitted to the same node and having the same node version • Provides the Query with a unique time-stamped identifier • Following the link, get data and process them for extracting relevant metadata (ex. Bibtex of references used for compiling the output file) • Store these relevant metadata • If identity of user is provided, we associate the user with the Query ID • Provides the Query with a unique time-stamped identifier • Following the link, get data and process them for extracting relevant metadata (ex. Bibtex of references used for compiling the output file) • Store these relevant metadata • If identity of user is provided, we associate the user with the Query ID Notification • We get the already existing unique identifier and (incrementally) associate the new query timestamp (and if provided the identifier of the user) with the query Identifier • We get the already existing unique identifier and (incrementally) associate the new query timestamp (and if provided the identifier of the user) with the query Identifier existing Notexisting
  • 149. Let us focus on the query store: Sketching the functioning – Technical internal point of view: VAMDC Node VAMDC Node Query Store Listener Service Query Store Listener Service Notification Remark on query uniqueness: •The query language supported by the VAMDC infrastructure is VSS2 (VAMDC SQL Subset 2, http://vamdc.eu/documents/standards/queryLanguage/vss2.html). •We are working on a specific VSS2 parser (based on Antlr) which should identify, from queries expressed in different ways, the ones that are semantically identical •We are designing this analyzer as an independent module, hoping to extend it to all SQL.
  • 150. Final remarks: • Our aims: • Provide the VAMDC infrastructure with an operational query store • Share our experience with other data-providers • Provide data-providers with a set of libraries/tools/methods for an easy implementation of a query store. • We will try to build a generic query store (i.e. using generic software blocks)
  • 151. UK Riverflow Archive Matthew Fry mfry@ceh.ac.uk
  • 152. UK National River Flow Archive  Curation and dissemination of regulatory river flow data for research and other access  Data used for significant research outputs, and a large number of citations annually  Updated annually but also regular revision of entire flow series through time (e.g. stations resurveyed)  Internal auditing, but history is not exposed to users
  • 153. UK National River Flow Archive  ORACLE Relational database  Time series and metadata tables  ~20M daily flow records, + monthly / daily catchment rainfall series  Metadata (station history, owners, catchment soils / geology, etc.)  Total size of ~5GB  Time series tables automatically audited, - But reconstruction is complex  Users generally download simple files  But public API is in development / R-NRFA package is out there  Fortunately all access is via single codeset
  • 154. Our data citation requirements  Cannot currently cite whole dataset  Allow citation a subset of the data, as it was at the time  Fit with current workflow / update schedule, and requirements for reproducibility  Fit with current (file download) and future (API) user practices  Resilient to gradual or fundamental changes in technologies used  Allow tracking of citations in publications
  • 155. Options for versioning / citation  “Regulate” queries: - limitations on service provided  Enable any query to be timestamped / cited / reproducible: - does not readily allow verification (e.g. checksum) of queries (R7), or storage of queries (R9)  Manage which queries can be citable: - limitation on publishing workflow?
  • 156. Versioning / citation solution  Automated archiving of entire database – version controlled scripts defining tables, creating / populating archived tables (largely complete)  Fits in with data workflow – public / dev versions – this only works because we have irregular / occasional updates  Simplification of the data model (complete)  API development (being undertaken independently of dynamic citation requirements): - allows subsetting of dataset in a number of ways – initially simply - need to implement versioning (started) to ensure will cope with changes to data structures  Fit to dynamic data citation recommendations? - Largely - Need to address mechanism for users to request / create citable version of a query  Resource required: estimated ~2 person months

Editor's Notes

  1. Specifically, through this project, we set out to: Implement the RDA Data Citation WG Rec Engage with other i2b2 adoptees in this conversation on reproducibility Contribute any code back to the i2b2 community
  2. In our proof of concept for RDC for timestamping and versioning we used temporal_tables extension for Postgres http://pgxn.org/dist/temporal_tables/ Picture copyright http://pgxn.org Each live data table that needs logging was given an additional column named “sys_period” which holds a timestamp range value identifying the start timestamp, end timestamp, and boundary condition of when the record is or was valid.  The name “sys_period” was chosen to be consistent with the temporal_tables documentation. A second copy of each live data table was created to hold historical data.  These history tables have the same structure as the live data tables, but without the same primary keys and are used for logging each data change. For consistency all history tables were given the same name as the table being logged, but prefaced with “hist_”.   The trigger provided by the temporal_tables extension was added to each live data table to be called before insert, update, or delete transactions.  The trigger inserts a copy of the existing row from the live data table into the historical table with the timestamp range modified to show the current date as the upper bound of the valid range.
  3. There are different ways to calculate the financial ROI. For one, it is to determine the hours of time we potentially save (or do not waste) trying to replicate a study. Given our calculations, it will take 14 studies to complete this.
  4. Millions of dollars are being put into data sharing and brokerage using clinical data. Clinical decisions are being made based upon these (and other) decisions. Hence, it is important to make this process rigorous. But why do I care?
  5. Center for Biomedical Informatics @WUSTL
  6. How to we facilitate moving some of the responsibility of reproducibility from the biomedical researcher and incorporating the principles into the biomedical pipeline.
  7. We saw the RDA/MacArthur grant as an opportunity to move forward with our vision.
  8. Specifically, through this project, we set out to: Implement the RDA Data Citation WG Rec Engage with other i2b2 adoptees in this conversation on reproducibility Contribute any code back to the i2b2 community
  9. Data Versioning: For retrieving earlier states of data of datasets the data needs to be versioned. Markers shall indicate inserts, updates, and deletes of data in the database. Data Timestamping: Ensure that operations are timestamped, i.e. any additions, deletions are marked with a timestamp. Query Store: Provide means for storing queries and the associated metadata in order to re-execute them in future. Query Timestamping: Assign a timestamp to the query based on the last update to the entire database. Query PID: The data shall be identified via a persistent identifier (PID) pointing to a timestamped query, resolving to a landing page. Query Citation: Provide a recommended citation text and the PID to the user.
  10. Must be able to handle large volume of data efficiently and relatively fast, because versioning requires storing history of changes to data, as well as scanning and updating large amounts of data Must be implementable for PostgreSQL, because this is the database system at CBMI for EHR Software tools used must have regular updates and maintenance support from community or reliable commercial source. Otherwise, the software tools used should be understood by in-house developers, so that support and updates could be provided by CBMI team, if no other source is available. Should be relatively easy to implement to allow for easy maintenance and upkeep. Should be relatively easy to use for data brokers, as they would need to run queries and access query audit.
  11. In our proof of concept for RDC for timestamping and versioning we used temporal_tables extension for Postgres http://pgxn.org/dist/temporal_tables/ Picture copyright http://pgxn.org Each live data table that needs logging was given an additional column named “sys_period” which holds a timestamp range value identifying the start timestamp, end timestamp, and boundary condition of when the record is or was valid.  The name “sys_period” was chosen to be consistent with the temporal_tables documentation. A second copy of each live data table was created to hold historical data.  These history tables have the same structure as the live data tables, but without the same primary keys and are used for logging each data change. For consistency all history tables were given the same name as the table being logged, but prefaced with “hist_”.   The trigger provided by the temporal_tables extension was added to each live data table to be called before insert, update, or delete transactions.  The trigger inserts a copy of the existing row from the live data table into the historical table with the timestamp range modified to show the current date as the upper bound of the valid range.
  12. During the Incremental ETL from Clinical Data Source, our ETL identifies whether a row of data is an update or insert to existing RDC data and if it’s an update, the data in RDC.table gets updated to the new values. However, the triggers we set up before each update or delete (if any) will first copy the original field values to the RDC.hist_table, and only then allow the update of data in RDC.table. Sys_period for live data tables will have the start of date range during which the record is valid and NULL as the end_date Inside hist_table the sys_period will have both start and end date of record validity.
  13. In order to implement the RDA recommendations in RDC for: Storing query and query metadata, Query Timestamping, storing Query PID (persistent modifier) and suggested Query Citation, we did the following: In addition to using Temporal_tables extension we also created a view as a union between each live data table and its corresponding history table so that data could be queried as it was at any given time.  These views were named with a name matching the live table with the addition of the suffix “_with_history”. Basically the view would have the data currently valid and all the historical versions of the same data along with the sys_period ranges of when the versions were valid. Several functions and triggers were set up to allow for storing user data and query data into query audit tables, at the time when data brokers would be querying the views.
  14. Today The Data Broker runs a query on the special View (a view over union of live data and hist_table). The query run is audited automatically, storing the query, its metadata, the date of execution, query PID and suggested Citation. The Data Broker gets back the data, and suggested citation, containing Query Date and Query PID. 2) Some Time Later The Data Broker can reproduce the results received for the Query PID, by providing that PID to a special function to re-run the Query. The function gets the Query Date range using the provided PID, and then operates over the union of live data and all the historical versions of that data, picking only the versions of data valid at the time when the Query with the provided PID was executed. The Data Broker gets the same dataset back as they did before. Tardis picture credit: http://pc012.deviantart.com/art/TARDIS-Simple-Vector-481264558
  15. The nice feature we added is a way for the Data Broker to find out whether the version at the time of Query with a certain PID has actually changed over time, by comparing it to the current data version. If the data haven’t changed, the result of the function would be false. If the data have changed, the comparison will be provided to the Data Broker.
  16. Success of this project will be based on a number of factors including: 1. The integration of WGDC recommendations into the WU CBMI instance of i2b2 2. Feedback from existing i2b2 installations on recommended changes, RDA WGDC compliant code
  17. Success of this project will be based on a number of factors including: 1. The integration of WGDC recommendations into the WU CBMI instance of i2b2 2. Feedback from existing i2b2 installations on recommended changes, RDA WGDC compliant code
  18. There are different ways to calculate the financial ROI. For one, it is to determine the hours of time we potentially save (or do not waste) trying to replicate a study. Given our calculations, it will take 14 studies to complete this.
  19. https://github.com/cbmiwu
  20. WHOAS: Woods Hole Open Access Server Example: http://www.bco-dmo.org/dataset/3046 Linked Data URI: http://lod.bco-dmo.org/id/dataset/3046 NCEI: http://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0078575 http://data.nodc.noaa.gov/geoportal/catalog/search/search.page DOI: http://dx.doi.org/10.1575/1912/4847 http://dx.doi.org/10.1575/bco-dmo.4847 Also harvested by and discoverable from DataONE
  21. The larval krill dataset with measurements from 2001 and 2002. The full dataset has been archived at NCEI and assigned an accession number. Data used for a publication may only include a portion of the data. Ideally the appropriate subset of data would be published out and assigned a DOI, enabling subsequent retrieval of the exact data used for a publication.
  22. Opportunity to update our information model
  23. BCO-DMO publishes data to WHOAS and a DOI is assigned. As of June 2016, the BCO-DMO architecture supports data versioning. In the current BCO-DMO data management system architecture data are served via the BCO-DMO website, with parallel options for publishing data at the WHOAS Institutional Repository and archiving data at the appropriate NCEI national data center. In the current BCO-DMO system architecture a Drupal content management system provides access to the metadata catalog and direct access to data via a URL. The Drupal MySQL content is published out in a variety of forms one of which is Linked Data. When the data are declared ‘Final, no updates expected’, a package of metadata and data can be exported from the BCO-DMO system and submitted to the Woods Hole Open Access System (WHOAS). The WHOAS is an Institutional Repository (IR) hosted by the WHOI Data Library and Archives (DLA) which is part of the larger Marine Biological Laboratory and Woods Hole Oceanographic Institution (MBLWHOI) Library system located in Woods Hole, MA. A Digital Object Identifier (DOI) is assigned when the package is deposited in WHOAS, with reciprocal links entered at BCO-DMO and WHOAS to connect the package at WHOAS with the record in the BCO-DMO catalog. The DOI resolves to the WHOAS dataset landing page, and a Linked Data URI resolves to the BCO-DMO landing page for the same dataset. However, the current system only supports one version of a dataset. NCEI is the appropriate National Data Center in the US for ocean research data WHOAS is the local OAIS-compliant institutional repository
  24. Data managed by BCO-DMO are published at WHOAS (the Institutional Repository (IR) curated by the MBLWHOI Data Library and Archives (DLA)), archived at NCEI and harvested by DataONE (an NSF funded harvesting system for environmental data). New data version assigned a new DOI (handle is versioned if only metadata changes) New capability (implemented): procedure: when a BCO-DMO data set is updated … A copy of the previous version is preserved Request a DOI for the new version of data Publish data, and create new landing page for new version of data, with new DOI assigned BCO-DMO database has links to all versions of the data (archived and published) Both archive and published dataset landing pages have links back to best version of full dataset at BCO-DMO BCO-DMO data set landing page displays links to all archived and published versions
  25. Versioned dataset example: http://darchive.mblwhoilibrary.org/handle/1912/6421 original http://darchive.mblwhoilibrary.org/handle/1912/6421.1 indicates the metadata only changed DOI: 10.1575/1912/6421 Dataset URL: http://www.bco-dmo.org/dataset/3766 Maas - Pteropod respiration rates http://www.bco-dmo.org/dataset/644840 LOD URI: http://lod.bco-dmo.org/id/dataset/644840 Data published at WHOAS: doi: 10.1575/1912/bco-dmo.651474
  26. Published dataset DOI NSF award numbers BCO-DMO dataset URI (published out as Linked Data)
  27. http://www.bco-dmo.org/dataset/644840 LOD URI: http://lod.bco-dmo.org/id/dataset/644840 Data published at WHOAS: doi: 10.1575/1912/bco-dmo.651474 Linked from BCO-DMO dataset landing page to: http://dx.doi.org/10.1016/j.pocean.2015.07.001
  28. Linking data sets with publications using DOIs
  29. http://www.bco-dmo.org/dataset/644840 LOD URI: http://lod.bco-dmo.org/id/dataset/644840 BCO-DMO Dataset landing page with new citation button CC by 4.0 license with suggested Citation text Based on DOI Data published at WHOAS: doi: 10.1575/1912/bco-dmo.651474 Using the DOI citation formatter service: http://crosscite.org/citeproc/ Doesn’t work yet for our DOIs (cause we don’t have enough DOI metadata), but when it does, it will return something like this Cite as: Twining, B. (2016). “Element Quotas of Individual Synechococcus Cells Collected During Bermuda Atlantic Time-Series Study (BATS) Cruises Aboard the R/V Atlantic Explorer Between Dates 2012-07-11 and 2013-10-13”. Version 05/06/2016. Biological and Chemical Oceanography Data Management Office (BCO-DMO) Dataset. doi:10.1575/1912/bco-dmo.651474 [access date]
  30. More information from RDA site URL: https://rd-alliance.org/groups/data-citation-wg.html In addition to BCO-DMO, there are already 8 other pilot studies that have expressed interest in adopting the DC WG recommendations: ARGO Austrian Centre for Digital Humanities BCO-DMO Center for Biomedical Informatics (CBMI), Washington University, St. Louis Climate Change Centre Austria ENVRIplus Natural History Museum London Ocean Networks Canada Virtual Atomic and Molecular Data Centre
  31. Altman and Crosas. 2013. “Evolution of Data Citation http://iassistdata.org/sites/default/files/iq/iqvol371_4_altman.pdf Altman, M., & Crosas, M. (2013). The evolution of data citation: from principles to implementation. IAssist Quarterly, 37(1-4), 62-70. CODATA-ICSTI 2013 “Out of cite, out of mind” https://www.jstage.jst.go.jp/article/dsj/12/0/12_OSOM13-043/_article Data Science Journal Vol. 12 (2013) p. CIDCR1-CIDCR75 http://doi.org/10.2481/dsj.OSOM13-043 Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data R. E. Duerr, et al. 2011 “On the utility of identification schemes for digital earth science data” http://link.springer.com/article/10.1007/s12145-011-0083-6 DOI: 10.1007/s12145-011-0083-6 (online version, open access) Duerr, R. E., Downs, R. R., Tilmes, C., Barkstrom, B., Lenhardt, W. C., Glassy, J., ... & Slaughter, P. (2011). On the utility of identification schemes for digital earth science data: an assessment and recommendations. Earth Science Informatics, 4(3), 139-160.
  32. This gives us a unique advantage to house data that has nowhere else to go, and the staff and technical capacity to bring in that data This brings its own complexity in terms of data diversity