Photo gallery - SIGMA-GIZ Academies on QM - Stage 1.pdf
Research Data Management, Challenges and Tools - Per Öster
1. CSC – Suomalainen tutkimuksen, koulutuksen, kulttuurin ja julkishallinnon ICT-osaamiskeskus
Research Data Management,
Challenges andTools
Per Öster, CSC – IT Center for Science Ltd
2. • Drivers
o Number of devices
o Number of communicating apps
o Number of users
2.6.20172
10% of UK
power
consumption
due to ICT
1/3
network
1/3
devices
1/3
datacentres
Amount of data
3. 6/2/173
Analysis
Publication
ReviewConceptualisation
Data
gathering
Open
access
Scientific
blogs Collaborative
bibliographies
Alternative
Reputation
systems
Citizens
science
Open
code
Open
workflows
Open
annotation
Open
data
Pre-
print
Data-
intensive
2!
Sci-
starter.com
Runmycode.
org
ArXiv
Roar.eprints.
org
Impact Story
Altmetric.com
Mendeley.com
Academia.edu
Researchgate.com
Openannotation.org
Datadryad.org
Myexperiment.org
Figshare.com
An#emerging#
ecosystem#of#
services#and#
standards#
It's real!
The DCC Curation
Lifecycle Model
Description and
Representation Information
Preservation Planning
Community Watch and
Participation
Curate and Preserve
Conceptualise
Create or Receive
Appraise and Select
Ingest
Preservation Action
Store
Access, Use and Reuse
Transform
Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand
and render both the digital material and the associated metadata.
Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions.
Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software.
Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle.
Conceive and plan the creation of data, including capture method and storage options.
Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation.
Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.
Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.
Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements.
Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning,
validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.
Store the data in a secure manner adhering to relevant standards.
Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable.
Create new data from the original, for example
- By migration into a different format.
- By creating a subset, by selection or query, to create newly derived results, perhaps for publication.
www.dcc.ac.uk
info@dcc.ac.uk
The Curation Lifecycle
The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model can be used to plan activities within an organisation or consortium to
ensure that all necessary stages are undertaken, each in the correct sequence. The model enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement. It can help with
the process of identifying additional steps which may be required, or actions which are not required by certain situations or disciplines, and ensuring that processes and policies are adequately documented.
Data, any information in binary digital form, is at the centre of the Curation Lifecycle. This includes:
- Simple Digital Objects are discrete digital items; such as textual files, images or sound files, along with their related identifiers and metadata.
- Complex Digital Objects are discrete digital objects, made by combining a number of other digital objects, such as websites.
Structured collections of records or data stored in a computer system.
Full Lifecycle Actions
Sequential Actions
Data (Digital Objects or Databases)
Occasional Actions
Dispose
Reappraise
Migrate
Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements. Typically data may be transferred to another archive, repository, data centre or
other custodian. In some instances data is destroyed. The data’s nature may, for legal reasons, necessitate secure destruction.
Return data which fails validation procedures for further appraisal and reselection.
Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data’s immunity from hardware or software obsolescence.
Digital Objects
Databases
5. Secure Compute Clouds
Supporting sample
logistics
• Federated Authentication
• Authorization
• Dataset registry
• Data transfer hub
• Policy and Legal Framework
Services and
Coordination
High speed encrypted
data transfer
GridFTP/Globus/Aspera
Secure data access remote API
( GA4GH )
Sequencing centers
Data
Users
EGA
at
Data Archiving
Bringing users
to data
Data Generation
Managing Access
Data Owner
Data Access Agreement
Data Access Committee
Data Request
Authorization Management Tools
( EGA and CSC REMS )
5
6. From field measurements to open data
2.6.20176
Questions:
Sensitive data?
Requirements on Authentication and authorization?
7. Instrument
Measuring PCs:
Raw data
at the stations
File servers at
stations:
Raw data and
field diaries,
cal documents
File servers
in Helsinki:
Raw and
intermediate
data,
documents,
scripts
SMEAR database:
Processed data
in Helsinki
ICOS, EBAS,...
databases:
Near real time
and processed data
outside UH
Routine data processing =
(- unit conversion)
- calibration correction
- quality check, gapfilling
- averaging over space or time
SMEAR
data flow
A/D conversion
unit conversion
IDA (CSC data
service):
Raw data &
document archive,
database datasets
Field
documentation
Researchers,
Data processing
server
Feedback on
data quality
Metadata
Metadata
Metadata
2.6.20177
https://avaa.tdata.fi/web/smart/smear
8. Support in All Phases of Research Process
20168
Plan
Customer Portal
Experts
Guides
Websites
Training
Service Desk
Produce
& Collect
Data
International
resources
Modelling
Software
Supercomputers
Analyse
Cloud Services
Training
Data science
Computing
Software
Store
B2SAFE
B2SHARE
HPC Archive
IDA
Databases
Research long-
term preservation
(LTP)
Share &
Publish
AVAA
B2DROP
B2SHARE
Databank
Etsin
Funet FileSender
11. ARTICLE 2: DISCLAIMER
1. The Service is provided "as is" and the Provider disclaims any and all
representations and warranties, whether express or implied, including;- but
not limited to;- implied warranties of title, merchantability, fitness for any
particular purpose or non-infringement. The Provider does not promise any
specific results, effects or outcome from the use of the Service.
2. …
3. The Provider reserves the right to change, reduce, interrupt or discontinue
the Service or parts of it at any time.
4. No one has a right to use the Service; the Provider reserves the right to
exclude certain Users.
12. Are the commercial services sufficient?
• Nice complement but can not serve as the fundamental infrastructure for research
data of national and international interest
• Need for publicly funded and operated infrastructure
2.6.201712
e-Science Data Factory
13. EUDAT CDI
“The EUDAT Collaborative Data
Infrastructure is a defined data model
and a set of technical standards and
policies adopted by European research
data centres and community data
repositories to create a single European
e-infrastructure of interoperable data
services.”
“To date, over 20 major European research organizations,
data and computing centres have signed an agreement to
sustain the EUDAT – pan European collaborative data
infrastructure for the next 10 years giving the birth to
the EUDAT Collaborative Data Infrastructure”
www.eudat.eu
15. Findable
– assign persistent IDs, provide rich metadata, register in a
searchable resource...
Accessible
– Retrievable by their ID using a standard protocol, metadata
remain accessible even if data aren’t...
Interoperable
– Use formal, broadly applicable languages, use standard
vocabularies, qualified references...
Reusable
– Rich, accurate metadata, clear licences, provenance, use of
community standards...
www.force11.org/group/fairgroup/fairprinciples
EUDAT is FAIR
17. B2FIND: multi-disciplinary metadata catalogue
now: common metadata catalogue, harvesting across
all CDI data, single point for data discoverability;
in development: aim to improve with agreed basic
metadata for all data objects.
B2HANDLE: policy-based prefix & PID management
now: common PID mechanism across all CDI data;
in development: aim to improve with agreed common
schema and behaviour.
B2SHARE: research data repository
now: full, tailored metadata support for data deposits.
B2SAFE: policy-driven data management
in development: aim to introduce a common data
model promoting metadata extraction and processing.
EUDAT & Findable
F
18. B2STAGE: data staging service
B2SHARE: research data repository
B2SAFE: policy-driven data management
now: EUDAT presents data through common Internet protocols and
APIs, http and gridftp;
in development: aim to improve with a single http API for all services
and data.
EUDAT & Accessible
A
19. B2HANDLE: policy-based prefix & PID management
now: common PID mechanism across all CDI data;
in development: aim to improve with agreed common
schema and behaviour.
B2STAGE: data staging service
B2SHARE: research data repository
B2SAFE: policy-driven data management
in development: single http API for all services and data
(interoperability of data services, if not data!).
B2FIND: multi-disciplinary metadata catalogue
in development: agreed basic metadata for all data
objects (a degree of metadata interop).
EUDAT & Interoperable
I
20. B2SHARE: research data repository
B2SAFE: policy-driven data management
now: encourage use of CC BY v 3 as common open data licence;
encourage open formats where we have any influence.
EUDAT & Reusable
R
21. Acknowledgment
• Tommi Nyrönen, ELIXIR-Finland Head of Node
• Mikael Linden, CSC
• TimmoVesala, INAR RI, Helsinki University
• Damien Lecarpentier, EUDAT Project Director
2.6.201721