Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Research Data Management, Challenges and Tools - Per Öster


Published on

This presentation was delivered at the LEARN Final Conference in London on 5 May 2017.

Published in: Government & Nonprofit
  • Be the first to comment

Research Data Management, Challenges and Tools - Per Öster

  1. 1. CSC – Suomalainen tutkimuksen, koulutuksen, kulttuurin ja julkishallinnon ICT-osaamiskeskus Research Data Management, Challenges andTools Per Öster, CSC – IT Center for Science Ltd
  2. 2. • Drivers o Number of devices o Number of communicating apps o Number of users 2.6.20172 10% of UK power consumption due to ICT 1/3 network 1/3 devices 1/3 datacentres Amount of data
  3. 3. 6/2/173 Analysis Publication ReviewConceptualisation Data gathering Open access Scientific blogs Collaborative bibliographies Alternative Reputation systems Citizens science Open code Open workflows Open annotation Open data Pre- print Data- intensive 2! Sci- Runmycode. org ArXiv Roar.eprints. org Impact Story An#emerging# ecosystem#of# services#and# standards# It's real! The DCC Curation Lifecycle Model Description and Representation Information Preservation Planning Community Watch and Participation Curate and Preserve Conceptualise Create or Receive Appraise and Select Ingest Preservation Action Store Access, Use and Reuse Transform Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand and render both the digital material and the associated metadata. Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions. Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software. Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle. Conceive and plan the creation of data, including capture method and storage options. Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation. Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata. Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements. Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements. Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats. Store the data in a secure manner adhering to relevant standards. Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable. Create new data from the original, for example - By migration into a different format. - By creating a subset, by selection or query, to create newly derived results, perhaps for publication. The Curation Lifecycle The DCC Curation Lifecycle Model provides a graphical high level overview of the stages required for successful curation and preservation of data from initial conceptualisation or receipt. The model can be used to plan activities within an organisation or consortium to ensure that all necessary stages are undertaken, each in the correct sequence. The model enables granular functionality to be mapped against it; to define roles and responsibilities, and build a framework of standards and technologies to implement. It can help with the process of identifying additional steps which may be required, or actions which are not required by certain situations or disciplines, and ensuring that processes and policies are adequately documented. Data, any information in binary digital form, is at the centre of the Curation Lifecycle. This includes: - Simple Digital Objects are discrete digital items; such as textual files, images or sound files, along with their related identifiers and metadata. - Complex Digital Objects are discrete digital objects, made by combining a number of other digital objects, such as websites. Structured collections of records or data stored in a computer system. Full Lifecycle Actions Sequential Actions Data (Digital Objects or Databases) Occasional Actions Dispose Reappraise Migrate Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements. Typically data may be transferred to another archive, repository, data centre or other custodian. In some instances data is destroyed. The data’s nature may, for legal reasons, necessitate secure destruction. Return data which fails validation procedures for further appraisal and reselection. Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data’s immunity from hardware or software obsolescence. Digital Objects Databases
  4. 4. 2.6.20174
  5. 5. Secure Compute Clouds Supporting sample logistics • Federated Authentication • Authorization • Dataset registry • Data transfer hub • Policy and Legal Framework Services and Coordination High speed encrypted data transfer GridFTP/Globus/Aspera Secure data access remote API ( GA4GH ) Sequencing centers Data Users EGA at Data Archiving Bringing users to data Data Generation Managing Access Data Owner Data Access Agreement Data Access Committee Data Request Authorization Management Tools ( EGA and CSC REMS ) 5
  6. 6. From field measurements to open data 2.6.20176 Questions: Sensitive data? Requirements on Authentication and authorization?
  7. 7. Instrument Measuring PCs: Raw data at the stations File servers at stations: Raw data and field diaries, cal documents File servers in Helsinki: Raw and intermediate data, documents, scripts SMEAR database: Processed data in Helsinki ICOS, EBAS,... databases: Near real time and processed data outside UH Routine data processing = (- unit conversion) - calibration correction - quality check, gapfilling - averaging over space or time SMEAR data flow A/D conversion unit conversion IDA (CSC data service): Raw data & document archive, database datasets Field documentation Researchers, Data processing server Feedback on data quality Metadata Metadata Metadata 2.6.20177
  8. 8. Support in All Phases of Research Process 20168 Plan Customer Portal Experts Guides Websites Training Service Desk Produce & Collect Data International resources Modelling Software Supercomputers Analyse Cloud Services Training Data science Computing Software Store B2SAFE B2SHARE HPC Archive IDA Databases Research long- term preservation (LTP) Share & Publish AVAA B2DROP B2SHARE Databank Etsin Funet FileSender
  9. 9. Termination. Company may terminate your access to all or any part of the Service at any time, with or without cause, with or without notice, effective immediately, which may result in the forfeiture and destruction of all information associated with your account, including User Submissions. If you wish to terminate your account, you may do so by following instructions available on the Site. Any fees paid hereunder are non-refundable. All provisions of the Terms of Use which by their nature should survive termination shall survive termination, including, without limitation, ownership provisions, warranty disclaimers, indemnity and limitations of liability.
  10. 10. ARTICLE 2: DISCLAIMER 1. The Service is provided "as is" and the Provider disclaims any and all representations and warranties, whether express or implied, including;- but not limited to;- implied warranties of title, merchantability, fitness for any particular purpose or non-infringement. The Provider does not promise any specific results, effects or outcome from the use of the Service. 2. … 3. The Provider reserves the right to change, reduce, interrupt or discontinue the Service or parts of it at any time. 4. No one has a right to use the Service; the Provider reserves the right to exclude certain Users.
  11. 11. Are the commercial services sufficient? • Nice complement but can not serve as the fundamental infrastructure for research data of national and international interest • Need for publicly funded and operated infrastructure 2.6.201712 e-Science Data Factory
  12. 12. EUDAT CDI “The EUDAT Collaborative Data Infrastructure is a defined data model and a set of technical standards and policies adopted by European research data centres and community data repositories to create a single European e-infrastructure of interoperable data services.” “To date, over 20 major European research organizations, data and computing centres have signed an agreement to sustain the EUDAT – pan European collaborative data infrastructure for the next 10 years giving the birth to the EUDAT Collaborative Data Infrastructure”
  13. 13. Findable – assign persistent IDs, provide rich metadata, register in a searchable resource... Accessible – Retrievable by their ID using a standard protocol, metadata remain accessible even if data aren’t... Interoperable – Use formal, broadly applicable languages, use standard vocabularies, qualified references... Reusable – Rich, accurate metadata, clear licences, provenance, use of community standards... EUDAT is FAIR
  14. 14. What is the EUDAT Service offer?
  15. 15. B2FIND: multi-disciplinary metadata catalogue now: common metadata catalogue, harvesting across all CDI data, single point for data discoverability; in development: aim to improve with agreed basic metadata for all data objects. B2HANDLE: policy-based prefix & PID management now: common PID mechanism across all CDI data; in development: aim to improve with agreed common schema and behaviour. B2SHARE: research data repository now: full, tailored metadata support for data deposits. B2SAFE: policy-driven data management in development: aim to introduce a common data model promoting metadata extraction and processing. EUDAT & Findable F
  16. 16. B2STAGE: data staging service B2SHARE: research data repository B2SAFE: policy-driven data management now: EUDAT presents data through common Internet protocols and APIs, http and gridftp; in development: aim to improve with a single http API for all services and data. EUDAT & Accessible A
  17. 17. B2HANDLE: policy-based prefix & PID management now: common PID mechanism across all CDI data; in development: aim to improve with agreed common schema and behaviour. B2STAGE: data staging service B2SHARE: research data repository B2SAFE: policy-driven data management in development: single http API for all services and data (interoperability of data services, if not data!). B2FIND: multi-disciplinary metadata catalogue in development: agreed basic metadata for all data objects (a degree of metadata interop). EUDAT & Interoperable I
  18. 18. B2SHARE: research data repository B2SAFE: policy-driven data management now: encourage use of CC BY v 3 as common open data licence; encourage open formats where we have any influence. EUDAT & Reusable R
  19. 19. Acknowledgment • Tommi Nyrönen, ELIXIR-Finland Head of Node • Mikael Linden, CSC • TimmoVesala, INAR RI, Helsinki University • Damien Lecarpentier, EUDAT Project Director 2.6.201721
  20. 20. Kuvat CSC:n arkisto ja Thinkstock CSC – IT Center for Science Ltd Per Öster Director, Research Infrastructures