Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)


Published on

EUDAT and PRACE joined forces to help research communities gain access to high quality managed e-Infrastructures whose resources can be connected together to enable cross-utilization use cases and make them accessible without any technical barrier. The capability to couple data and compute resources together is considered one of the key factors to accelerate scientific innovation and advance research frontiers. The goal of this session was to present the EUDAT services, the results of the collaboration activity achieved so far and delivers a hands-on on how to write a Data Management Plan or DMP. The DMP is a useful instrument for researchers to reflect on and communicate about the way they will deal with their data. It prompts them to think about how they will generate, analyse and share data during their research project and afterwards.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)

  1. 1. EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Linking HPC to Data Management Stéphane COUTIN (CINES) Giuseppe Fiameni (CINECA) This work is licensed under the Creative Commons CC-BY 4.0 licence
  2. 2. Objectives High level presentation of research data management and H2020 context Present a simple approach and draft a DMP for a given case.
  3. 3. THE CHANGING DATA LANDSCAPE Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox
  4. 4. Data explosion More and more data is being created Issue is not creating data, but being able to navigate and use it Data management is critical to make sure data are well-organised, understandable and reusable
  5. 5. Digital data are fragile and susceptible to loss for a wide variety of reasons Natural disaster Facilities infrastructure failure Storage failure Server hardware/software failure Application software failure Format obsolescence Human error Malicious attack Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations Data loss Image CC-BY ‘Hard Drive 016’ by Jon Ross
  6. 6. Link rot – more 404 errors generated over time Reference rot* – link rot plus content drift i.e. webpages evolving and no longer reflecting original content cited * Term coined by Hiberlink Data persistency issues Jonathan D. Wren Bioinformatics 2008;24:1381-1385
  8. 8. Why manage research data? To make your research easier! To stop yourself drowning in irrelevant stuff In case you need the data later To avoid accusations of fraud or bad science To share your data for others to use and learn from To get credit for producing it Because funders or your organisation require it Well-managed data opens up opportunities for re-use, integration and new science
  9. 9. H2020 open research data pilot • Already expanded from a select pilot to all work areas • All need to consider which data can be made open • Mantra = “As open as possible as closed as necessary” • Underlying driver is good (FAIR) data management Image CC-BY-SA by SangyaPundir
  10. 10. Key requirements of the open data pilot Beneficiaries participating in the Pilot will: Deposit data in a research data repository of their choice Take measures to make it possible for others to access, mine, exploit, reproduce and disseminate the data free of charge Provide information about tools and instruments necessary for validating the results (where possible, provide the tools and instruments themselves) /oa_pilot/h2020-hi-oa-data-mgt_en.pdf
  11. 11. Suggested DMP creation process Analyse your project Information System Suggest : Data Flow Diagram Apply FAIR principles Include data life cycle and time dimensions Estimate costs Iterate Get funders support Maintain DMP up to date
  12. 12. Simple diagram focusing on data dynamics You can use other diagram type DFD : Data Flow Diagram Data Processing Data store External interaction Data Flow
  13. 13. You and your team are submitting a proposal for a project in the domain of smart cities. The City has implemented a large set of sensors measuring traffic. The data are collected in the City datacenter. You want to develop an application being able to forecast the traffic and also how it will be impacted by events like planned roadworks. This application would run on a PRACE site, not located in the City. On the PRACE site your storage space is limited to 10 TB. The application uses the following inputs: Sensors historical data over the last 12 months : sensors produce 1TB of data a day. You implement a preprocessing module translating those data into a reduced data set (10 MB per day). It is based on a format you have defined to describe the traffic. The results provided by the simulation. This enables comparison between forecasted and actual traffic in order to ‘train’ the application. Weather data (historical and forecast) provided by the national meteo agency. They use the SYNOP format. The volume is negligible. Results will be accessible by the city council employees. Create the project data flow diagram and fill the data summary chapter using a table. What would you appreciate to use efficiently the weather data? Exercise – Phase 1
  14. 14. Data summary table Dataset Description Origin? Existing? Format Size Who could use it?
  15. 15. Proposed data flow diagram Sensors collection area PRACE HPC Site Simulations PRACE Storage Output files extractor Input files Raw sensor data Data Preprocessing Reduced sensor data Weather data City council employees Data transfer
  16. 16. Data summary table Dataset Description Origin? Existing? Format Size Who could use it? Raw sensor data Available, collected from sensors Various 1TB per day Reduced sensor data Actual traffic, … Extracted from raw sensor data Binary (specific) 10 MB a day Our simulation Weather data Actual and forecast Existing. Meteo open data platform SYNOP 1MB a week Our simulation Citizens, scientists, .. Simulation results Forecasted traffic Results of our simulation Binary (specific) 10 MB a day City council employees, our application
  17. 17. CREATING DATA PROCESSING DATA ANALYSING DATA PRESERVING DATA GIVING ACCESS TO DATA RE-USING DATA Research data lifecycle CREATING DATA: designing research, DMPs, planning consent, locate existing data, data collection and management, capturing and creating metadata RE-USING DATA: follow- up research, new research, undertake research reviews, scrutinising findings, teaching & learning ACCESS TO DATA: distributing data, sharing data, controlling access, establishing copyright, promoting data PRESERVING DATA: data storage, back- up & archiving, migrating to best format & medium, creating metadata and documentation ANALYSING DATA: interpreting, & deriving data, producing outputs, authoring publications, preparing for sharing PROCESSING DATA: entering, transcribing, checking, validating and cleaning data, anonymising data, describing data, manage and store data Ref: UK Data Archive:
  18. 18. Bitstream Persistent Identifier Metadata Digital objects can be aggregated to digital collections What is a digital object?
  19. 19. CDI Data Model 22
  20. 20. Digital object example
  21. 21. A file format is a convention on how a data is represented on a media. It can be: Specified: a description of the convention exists, and is sufficiently described to allow a complete implementation of it; Open: the convention is available without any restrictions of access or implementation; Standardized: the convention has been adopted by standardization agencies (ISO, W3C). Example: PDF/A. A wide utilization of a format can also enable it to be considered as a standard, even if there’s no official standard for it. Example: PDF. Proprietary: those formats depend on the existence of an owner. They can be published. Example: Word. The level of durability of a format depends on these criteria. Data formats
  22. 22. Through a web interface, this tool enables the verification of a file, especially its validity and if it’s well- formed against the specifications of the declared format, to know if it can be archived. You just have to download the file you want to test. The file is then analyzed by the tool which sends automatically the answer. If the file is not well-formed or not valid, tutorials to help correcting the file are available for the user. If the problem is not resolved, the user can contact the CINES expertise by e-mail. The list of the file formats accepted in PAC (CINES Arrchiving Platform) is available on FACILE ( ) FACILE : a format validation tool
  23. 23. Complexity and diversity of file formats A few ‘pivot’ formats HDF NetCDF A lot of specific binaries formats Need to document the format Store or reference documentation in the digital object Store or reference code HPC data formats
  24. 24. Licensing research data • Horizon 2020 guidelines point to CC-BY or CC-0 • EUDAT licensing wizard help you pick licence for data & software (available in B2SHARE) • DCC How-to guide helps you to license data
  25. 25. Commonly defined as ‘data about data’, metadata helps to make data findable and understandable Metadata can be: Descriptive: information about the content and context of the data Structural: information about the structure of the data Administrative: information about the file type, rights management and preservation processes What is metadata?
  26. 26. Comprehensive metadata will: Facilitate data discovery Help users determine the applicability of the data Enable interpretation and reuse Allow any limitations to be understood Clarify ownership and restrictions on reuse Offer permanence as it transcends people and time Provide interoperability Why use metadata?
  27. 27. The good and the bad Metres / seconds 2015-09-10T15:00:01+01:00 Longitudinal wind speed PDF 1.7 2008 US Population statistics Barcelona, Venezuela Furlongs and fortnight 10th Sept. 2015 15:00:01 U PDF Population statistics Barcelona More precise and standardised Ambiguous
  28. 28. Digital preservation context 39 Main risks deal with: • Comprehension • Integrity • Exploitation • Valorization Quality assurance procedures to be setup for • Metadata • File formats • Representation information • Storage • Access • Technology watching
  29. 29. Digital preservation challenges 40 Setup quality assurance procedures to mitigate the impact of the four main identified risks when they occur Challenge Solutions Loss of content knowledge • Metadata; • Persistent, unique identifiers. File format obsolescence • Handling of a limited set of durable formats; • File format identification, validation; • Logical migration (format conversion). Storage media failure • Management of media ageing; • Physical migration. Software or hardware disappearance • Technology watching , anticipation , proactivity. More details at
  30. 30. Certifications Certification can help selecting a repository Certification focuses on: Organizational infrastructure Digital object management Technology Usually refers to OAIS model
  31. 31. OAIS (Open Archival Information System) model Framework for an archive, now ISO 14721 Defines a functional and an informations models
  32. 32. Repository certification : Data Seal of approval 16 quality guidelines for researchers and institutions that create digital research files, organizations that archive research files, and users of research data. The objectives of the Data Seal of Approval are to safeguard data, to ensure high quality and to guide reliable management of research data for the future without requiring the implementation of new standards, regulations or high costs. The DSA Gives researchers, research sponsors the assurance that their research results will be stored in a reliable manner and can be reused Allows data repositories to archive and distribute research data efficiently Is part of a European Framework for Audit and Certification of Trusted Repositories Online application and self-assessment of the 16 guidelines by the repository Review by a member of the DSA Board
  33. 33. Formal certification: ISO 16363 ISO 16363 – « Audit and certification of trustworthy digital repositories » Evaluation criteria for an auditor to judge if a repository is trustworthy) Published in 2012 Strongly based on OAIS reference model ISO 16919:2014 – « Requirements for bodies providing audit and certification of candidate trustworthy digital repositories » specifies requirements for bodies providing ISO 16363 audit and certification – provide detailed competences that auditors need
  34. 34. Thanks – any questions Acknowledgements: Thanks to Mark van de Sanden, Marjan Grootveld , Sarah Jones and Giuseppe Fiameni for some of the slides