Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data management plans – EUDAT Best practices and case study | www.eudat.eu

528 views

Published on

| www.eudat.eu | Presentation given by Stéphane Coutin during the PRACE 2017 Spring School joint training event with the EU H2020 VI-SEEM project (https://vi-seem.eu/) organised by CaSToRC at The Cyprus Institute. Science and more specifically projects using HPC is facing a digital data explosion. Instruments and simulations are producing more and more volume; data can be shared, mined, cited, preserved… They are a great asset, but they are facing risks: we can miss storage, we can lose them, they can be misused,… To start this session, we will review why it is important to manage research data and how to do this by maintaining a Data Management Plan. This will be based on the best practices from EUDAT H2020 project and European Commission recommendation. During the second part we will interactively draft a DMP for a given use case.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data management plans – EUDAT Best practices and case study | www.eudat.eu

  1. 1. EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu Data Management Plans – EUDAT best practices and case study Stéphane COUTIN (CINES) 26 April 2017 This work is licensed under the Creative Commons CC-BY 4.0 licence
  2. 2. Objectives High level presentation of research data management and H2020 context Present a simple approach and draft a DMP for a given case. Overview of EUDAT services
  3. 3. CINES
  4. 4. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  5. 5. Stéphane COUTIN coutin@cines.fr Research engineer at CINES since 2013 Initialy in digital preservation dept Now working in HPC dept Involved in EU projects Leading PRACE collaboration task with other eInfra Working on EUDAT for collaboration with PRACE Background in Information Systems projects and programmes management
  6. 6. EUDAT – www.eudat.eu Image CC-BY-NC ‘Data centre’ by Bob Mical www.flickr.com/photos/small_realm/15995555571
  7. 7. A pan-European e-Infrastructure solution for pan- European RI data Challenges All RIs are facing data challenges Where to store the growing amount of data? How to find it? How to make the most of it? Solutions are needed at pan-European level 7 We need to promote synergies Some services are common to many communities Costs and investments can be optimised Better integration of e-infras and research infrastructures can be achieved
  8. 8. e-Science Data Factory EUDAT2020 - 35 Partners
  9. 9. A truly pan-European Infrastructure EUDAT offers common data services, supporting multiple research communities as well as individuals, through a geographically distributed, resilient network of 35 European organisations The EUDAT vision is to enable European researchers and practitioners from any research discipline to preserve, find, access, and process data in a trusted environment, as part of a Collaborative Data Infrastructure
  10. 10. PRACE – EUDAT collaboration Joint Open Calls for proposals EUDAT offering data services and resources through regular PRACE calls Review process is transparent to users Joint training activities Continuous technical discussion and developments of new components Definition of the EUDAT Workspace area Synchronization of authentication credentials for single sign-on 10
  11. 11. Quick question: Think to your ongoing or next to start HPC project What are your data related requirement? What is the budget for this?
  12. 12. THE CHANGING DATA LANDSCAPE Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox www.flickr.com/photos/rh2ox/9990016123
  13. 13. Data explosion More and more data is being created Issue is not creating data, but being able to navigate and use it Data management is critical to make sure data are well-organised, understandable and reusable
  14. 14. Digital data are fragile and susceptible to loss for a wide variety of reasons Natural disaster Facilities infrastructure failure Storage failure Server hardware/software failure Application software failure Format obsolescence Human error Malicious attack Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations Data loss Image CC-BY ‘Hard Drive 016’ by Jon Ross www.flickr.com/photos/jon_a_ross/1482849745
  15. 15. Link rot – more 404 errors generated over time Reference rot* – link rot plus content drift i.e. webpages evolving and no longer reflecting original content cited * Term coined by Hiberlink http://hiberlink.org Data persistency issues Jonathan D. Wren Bioinformatics 2008;24:1381-1385
  16. 16. CINES
  17. 17. Why manage research data? To make your research easier! To stop yourself drowning in irrelevant stuff In case you need the data later To avoid accusations of fraud or bad science To share your data for others to use and learn from To get credit for producing it Because funders or your organisation require it Well-managed data opens up opportunities for re-use, integration and new science
  18. 18. H2020 open research data pilot • Already expanded from a select pilot to all work areas • All need to consider which data can be made open • Mantra = “As open as possible as closed as necessary” • Underlying driver is good (FAIR) data management Image CC-BY-SA by SangyaPundir
  19. 19. Key requirements of the open data pilot Beneficiaries participating in the Pilot will: Deposit data in a research data repository of their choice Take measures to make it possible for others to access, mine, exploit, reproduce and disseminate the data free of charge Provide information about tools and instruments necessary for validating the results (where possible, provide the tools and instruments themselves) http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi /oa_pilot/h2020-hi-oa-data-mgt_en.pdf
  20. 20. Suggested DMP creation process
  21. 21. Simple diagram focusing on data dynamics You can use other diagram type DFD : Data Flow Diagram
  22. 22. You and your team are submitting a proposal for a project in the domain of smart cities. The City has implemented a large set of sensors measuring traffic. The data are collected in the City datacenter. You want to develop an application being able to forecast the traffic and also how it will be impacted by events like planned roadworks. This application would run on a PRACE site, not located in the City. On the PRACE site your storage space is limited to 10 TB. The application uses the following inputs: Sensors historical data over the last 12 months : sensors produce 1TB of data a day. You implement a preprocessing module translating those data into a reduced data set (10 MB per day). It is based on a format you have defined to describe the traffic. The results provided by the simulation. This enables comparison between forecasted and actual traffic in order to ‘train’ the application. Weather data (historical and forecast) provided by the national meteo agency. They use the SYNOP format. The volume is negligible. Results will be accessible by the city council employees. Create the project data flow diagram and fill the data summary chapter using a table. What would you appreciate to use efficiently the weather data? Exercise – Phase 1
  23. 23. Data summary table Dataset Description Origin? Existing? Format Size Who could use it?
  24. 24. Proposed data flow diagram Sensors collection area PRACE HPC Site Simulations PRACE Storage Output files extractor Input files Raw sensor data Data Preprocessing Reduced sensor data Weather data City council employees Data transfer
  25. 25. Data summary table Dataset Description Origin? Existing? Format Size Who could use it? Raw sensor data Available, collected from sensors Various 1TB per day Reduced sensor data Actual traffic, … Extracted from raw sensor data Binary (specific) 10 MB a day Our simulation Weather data Actual and forecast Existing. Meteo open data platform SYNOP 1MB a week Our simulation Citizens, scientists, .. Simulation results Forecasted traffic Results of our simulation Binary (specific) 10 MB a day City council employees, our application
  26. 26. CREATING DATA PROCESSING DATA ANALYSING DATA PRESERVING DATA GIVING ACCESS TO DATA RE-USING DATA Research data lifecycle CREATING DATA: designing research, DMPs, planning consent, locate existing data, data collection and management, capturing and creating metadata RE-USING DATA: follow- up research, new research, undertake research reviews, scrutinising findings, teaching & learning ACCESS TO DATA: distributing data, sharing data, controlling access, establishing copyright, promoting data PRESERVING DATA: data storage, back- up & archiving, migrating to best format & medium, creating metadata and documentation ANALYSING DATA: interpreting, & deriving data, producing outputs, authoring publications, preparing for sharing PROCESSING DATA: entering, transcribing, checking, validating and cleaning data, anonymising data, describing data, manage and store data Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
  27. 27. Findable – Assign persistent IDs, provide rich metadata, register in a searchable resource,... Accessible – Retrievable by their ID using a standard protocol, metadata remain accessible even if data aren’t... Interoperable – Use formal, broadly applicable languages, use standard vocabularies, qualified references... Reusable – Rich, accurate metadata, clear licences, provenance, use of community standards... FAIR for machines as well as people www.force11.org/group/fairgroup/fairprinciples Making data FAIR
  28. 28. Bitstream Persistent Identifier Metadata Digital objects can be aggregated to digital collections What is a digital object?
  29. 29. Digital object example
  30. 30. Metadata and documentation is needed to locate and understand research data Think about what others would need in order to find, evaluate, understand, and reuse your data. Get others to check the metadata to improve quality Use standards to enable interoperability Metadata and documentation
  31. 31. Use of standards Controlled vocabularies for unambiguous keywords Simple, complete andconsistent information Appropriate description Explanation of limitations to support reuse Avoid special characters e.g. !@<~ etc... Provide persistent identifiers such as DOIs What makes metadata good?
  32. 32. The good and the bad Metres / seconds 2015-09-10T15:00:01+01:00 Longitudinal wind speed PDF 1.7 2008 US Population statistics Barcelona, Venezuela Furlongs and fortnight 10th Sept. 2015 15:00:01 U PDF Population statistics Barcelona More precise and standardised Ambiguous
  33. 33. Digital preservation context 34 Main risks deal with: • Comprehension • Integrity • Exploitation • Valorization Quality assurance procedures to be setup for • Metadata • File formats • Representation information • Storage • Access • Technology watching
  34. 34. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  35. 35. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  36. 36. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  37. 37. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  38. 38. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  39. 39. Based in Montpellier, France – approx. 60 people (engineers, techs, admin) Created in 1999, aka CNUSC (Centre National Universitaire Sud de Calcul) – created in 1980 Administrated and funded by Ministry of Higher Education & Research (MESR) 4 Provides the French public research community with computing resources, services and expertise 3 main mandates / activities: High performance computing Digital preservation Hosting Centre Informatique National de l’Enseignement Supérieur
  40. 40. Exercise phase 2 updated DFD
  41. 41. EUDAT services
  42. 42. CDI Data Domain EUDAT Data Domain modeled on the ANDS1 Data Curation Continiuum 1. Australian National Data Service organization – www.ands.org.au
  43. 43. Store and exchange data with colleagues and team members, including research data not finalized for publishing share data with fine-grained access controls synchronize multiple versions of data across different devices An ideal solution for researchers and scientists to: Features: 20 GB storage per user Living objects, so no PIDs Versioning and offline use Desktop synchronisation Sync and Share Research Data B2DROP – personal cloud b2drop.eudat.eu
  44. 44. store data safely at a trusted and certified data centre preserve data to guarantee long-term persistence control access and share data with colleagues and the world A winning solution for researchers, scientists and communities to: Features: Metadata management Permanent PIDs Open Access support Store and Publish Research Data B2SHARE - repository b2share.eudat.eu
  45. 45. replicate research data into secure data stores archive and preserve research data in the long-term bring data close to powerful compute resources co-locate data with different communities benefit from economies of scale The ideal solution for communities with no facility for archival to: Features: Large-scale storage Robust and highly available Permanent PIDs Replicate Research Data Safely B2SAFE - preservation eudat.eu/b2safe
  46. 46. move large amounts of data between data stores and high- performance compute resources re-ingest computational results back into EUDAT deposit large data sets onto EUDAT resources for long-term preservation Facilitating communities to: Features: High-speed transfer Reliable and light-weight Manages permanent PIDs Get Data to Computation B2STAGE - transfer eudat.eu/b2stage
  47. 47. seek data objects and collections using powerful metadata searches catalogue community data by means of selected metadata browse through multi-disciplinary data collections filtered by content, provenance and temporal keywords A metadata catalogue service to: Features: Simple to use Standards-based Comprehensive catalogue Find Research Data B2FIND - catalogue b2find.eudat.eu
  48. 48. Update the dataflow diagram with EUDAT services you could use for preservation and metadata publication. Exercise phase 3
  49. 49. DFD with other EUDAT services Simulations PRACE Storage Output files extractor Input files EUDAT B2SHARE B2SHARE storage Web front end Or API Results Traffic data Data extractor (uses API) Publication (uses API) Actual traffic Forecast traffic Citizens, researchers, companies, ... Search and retrieve data EUDAT Site EUDAT B2SAFE sorage EUDAT B2SAFE Data and metadata EUDAT B2FIND Metadata Metadata search Replication
  50. 50. www.eudat.eu Thanks – any questions Acknowledgements: Thanks to Mark van de Sanden, Marjan Grootveld , Sarah Jones and Giuseppe Fiameni for some of the slides

×