EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu
Linking HPC to Data
Management
Stéphane COUTIN (CINES)
Giuseppe Fiameni (CINECA)
This work is licensed under the Creative
Commons CC-BY 4.0 licence
Objectives
High level presentation of research
data management and H2020 context
Present a simple approach and draft a
DMP for a given case.
THE CHANGING DATA LANDSCAPE
Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox www.flickr.com/photos/rh2ox/9990016123
Data explosion
More and more data is
being created
Issue is not creating
data, but being able to
navigate and use it
Data management is
critical to make sure
data are well-organised,
understandable and
reusable
Digital data are fragile and susceptible to loss for a wide variety of reasons
Natural disaster
Facilities infrastructure failure
Storage failure
Server hardware/software failure
Application software failure
Format obsolescence
Human error
Malicious attack
Loss of staffing competencies
Loss of institutional commitment
Loss of financial stability
Changes in user expectations
Data loss
Image CC-BY ‘Hard Drive 016’ by Jon Ross www.flickr.com/photos/jon_a_ross/1482849745
Link rot – more 404 errors
generated over time
Reference rot* – link rot
plus content drift i.e.
webpages evolving and
no longer reflecting
original content cited
* Term coined by Hiberlink http://hiberlink.org
Data persistency issues
Jonathan D. Wren Bioinformatics 2008;24:1381-1385
MANAGING & SHARING DATA
Why manage research data?
To make your research easier!
To stop yourself drowning in irrelevant stuff
In case you need the data later
To avoid accusations of fraud or bad science
To share your data for others to use and learn from
To get credit for producing it
Because funders or your organisation require it
Well-managed data opens up opportunities
for re-use, integration and new science
H2020 open research data pilot
• Already expanded from a select pilot to all work
areas
• All need to consider which data can be made
open
• Mantra = “As open as possible as closed as
necessary”
• Underlying driver is good (FAIR) data
management
Image CC-BY-SA by SangyaPundir
Key requirements of the open data pilot
Beneficiaries participating in the Pilot will:
Deposit data in a research data repository of
their choice
Take measures to make it possible for others to
access, mine, exploit, reproduce and
disseminate the data free of charge
Provide information about tools and instruments
necessary for validating the results (where
possible, provide the tools and instruments
themselves)
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi
/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
Suggested DMP creation process
Analyse your project Information System
Suggest : Data Flow Diagram
Apply FAIR principles
Include data life cycle and time dimensions
Estimate costs
Iterate
Get funders support
Maintain DMP up to date
Simple diagram focusing on data dynamics
You can use other diagram type
DFD : Data Flow Diagram
Data
Processing
Data store
External
interaction
Data Flow
You and your team are submitting a proposal for a project in the domain of smart cities.
The City has implemented a large set of sensors measuring traffic. The data are collected
in the City datacenter.
You want to develop an application being able to forecast the traffic and also how it will
be impacted by events like planned roadworks. This application would run on a PRACE
site, not located in the City. On the PRACE site your storage space is limited to 10 TB.
The application uses the following inputs:
Sensors historical data over the last 12 months : sensors produce 1TB of data a day.
You implement a preprocessing module translating those data into a reduced data set
(10 MB per day). It is based on a format you have defined to describe the traffic.
The results provided by the simulation. This enables comparison between forecasted
and actual traffic in order to ‘train’ the application.
Weather data (historical and forecast) provided by the national meteo agency. They
use the SYNOP format. The volume is negligible.
Results will be accessible by the city council employees.
Create the project data flow diagram and fill the data summary chapter using a
table.
What would you appreciate to use efficiently the weather data?
Exercise – Phase 1
Data summary table
Dataset Description Origin?
Existing?
Format Size Who could use it?
Proposed data flow diagram
Sensors collection area
PRACE HPC Site
Simulations
PRACE
Storage
Output files
extractor
Input files
Raw sensor
data
Data
Preprocessing
Reduced
sensor data
Weather data
City council
employees
Data transfer
Data summary table
Dataset Description Origin? Existing? Format Size Who could use it?
Raw sensor
data
Available, collected
from sensors
Various 1TB per
day
Reduced
sensor data
Actual
traffic, …
Extracted from raw
sensor data
Binary
(specific)
10 MB a
day
Our simulation
Weather
data
Actual and
forecast
Existing. Meteo open
data platform
SYNOP 1MB a
week
Our simulation
Citizens, scientists, ..
Simulation
results
Forecasted
traffic
Results of our
simulation
Binary
(specific)
10 MB a
day
City council
employees, our
application
CREATING
DATA
PROCESSING
DATA
ANALYSING
DATA
PRESERVING
DATA
GIVING
ACCESS TO
DATA
RE-USING
DATA
Research data lifecycle
CREATING DATA: designing research,
DMPs, planning consent, locate existing
data, data collection and management,
capturing and creating metadata
RE-USING DATA: follow-
up research, new
research, undertake
research reviews,
scrutinising findings,
teaching & learning
ACCESS TO DATA:
distributing data,
sharing data,
controlling access,
establishing copyright,
promoting data PRESERVING DATA: data storage, back-
up & archiving, migrating to best format
& medium, creating metadata and
documentation
ANALYSING DATA:
interpreting, & deriving
data, producing outputs,
authoring publications,
preparing for sharing
PROCESSING DATA:
entering, transcribing,
checking, validating and
cleaning data, anonymising
data, describing data,
manage and store data
Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
Bitstream
Persistent Identifier
Metadata
Digital objects can be
aggregated to digital
collections
What is a digital object?
CDI Data Model
22
Digital object example
A file format is a convention on how a data is
represented on a media. It can be:
Specified: a description of the convention exists,
and is sufficiently described to allow a complete
implementation of it;
Open: the convention is available without any
restrictions of access or implementation;
Standardized: the convention has been adopted
by standardization agencies (ISO, W3C). Example:
PDF/A.
A wide utilization of a format can also enable it to be
considered as a standard, even if there’s no official standard for
it. Example: PDF.
Proprietary: those formats depend on the existence
of an owner. They can be published. Example: Word.
The level of durability of a format depends on these
criteria.
Data formats
Through a web interface, this tool enables the
verification of a file, especially its validity and if it’s well-
formed against the specifications of the declared
format, to know if it can be archived.
You just have to download the file you want to test. The
file is then analyzed by the tool which sends
automatically the answer.
If the file is not well-formed or not valid, tutorials to help
correcting the file are available for the user. If the
problem is not resolved, the user can contact the CINES
expertise by e-mail.
The list of the file formats accepted in PAC (CINES
Arrchiving Platform) is available on FACILE
(https://facile.cines.fr/ )
FACILE : a format validation tool
Complexity and diversity of file formats
A few ‘pivot’ formats
HDF
NetCDF
A lot of specific binaries formats
Need to document the format
Store or reference documentation in the digital
object
Store or reference code
HPC data formats
Licensing research data
• Horizon 2020 guidelines point to CC-BY or CC-0
• EUDAT licensing wizard help you pick licence for data & software
(available in B2SHARE)
• DCC How-to guide helps you to license data
www.dcc.ac.uk/resources/how-guides/license-research-data
Commonly defined as ‘data about data’, metadata
helps to make data findable and understandable
Metadata can be:
Descriptive: information about the content and
context of the data
Structural: information about the structure of the
data
Administrative: information about the file type, rights
management and preservation processes
What is metadata?
Comprehensive metadata will:
Facilitate data discovery
Help users determine the applicability of the data
Enable interpretation and reuse
Allow any limitations to be understood
Clarify ownership and restrictions on reuse
Offer permanence as it transcends people and time
Provide interoperability
Why use metadata?
The good and the bad
Metres / seconds
2015-09-10T15:00:01+01:00
Longitudinal wind speed
PDF 1.7
2008 US Population statistics
Barcelona, Venezuela
Furlongs and fortnight
10th Sept. 2015 15:00:01
U
PDF
Population statistics
Barcelona
More precise and
standardised Ambiguous
Digital preservation context
39
Main risks deal with:
• Comprehension
• Integrity
• Exploitation
• Valorization
Quality assurance
procedures to be setup for
• Metadata
• File formats
• Representation information
• Storage
• Access
• Technology watching
Digital preservation challenges
40
Setup quality assurance procedures to mitigate the
impact of the four main identified risks when they
occur
Challenge Solutions
Loss of content knowledge • Metadata;
• Persistent, unique identifiers.
File format obsolescence • Handling of a limited set of durable formats;
• File format identification, validation;
• Logical migration (format conversion).
Storage media failure • Management of media ageing;
• Physical migration.
Software or hardware disappearance • Technology watching , anticipation ,
proactivity.
More details at https://www.cines.fr/en/long-term-preservation/
Certifications
Certification can help selecting a repository
Certification focuses on:
Organizational infrastructure
Digital object management
Technology
Usually refers to OAIS model
OAIS (Open Archival Information System) model
Framework for an archive, now ISO 14721
Defines a functional and an informations models
Repository certification : Data Seal of
approval
16 quality guidelines for researchers and institutions that create
digital research files, organizations that archive research files, and
users of research data.
The objectives of the Data Seal of Approval are to safeguard
data, to ensure high quality and to guide reliable management
of research data for the future without requiring the
implementation of new standards, regulations or high costs.
The DSA
Gives researchers, research sponsors the assurance that their
research results will be stored in a reliable manner and can be
reused
Allows data repositories to archive and distribute research
data efficiently
Is part of a European Framework for Audit and Certification of
Trusted Repositories
Online application and self-assessment of the 16 guidelines by the
repository
Review by a member of the DSA Board
Formal certification: ISO 16363
ISO 16363 – « Audit and certification of trustworthy
digital repositories »
Evaluation criteria for an auditor to judge if a
repository is trustworthy)
Published in 2012
Strongly based on OAIS reference model
ISO 16919:2014 – « Requirements for bodies
providing audit and certification of candidate
trustworthy digital repositories »
specifies requirements for bodies providing ISO
16363 audit and certification – provide detailed
competences that auditors need
www.eudat.eu
Thanks – any questions
Acknowledgements:
Thanks to Mark van de Sanden, Marjan Grootveld , Sarah Jones
and Giuseppe Fiameni for some of the slides

Linking HPC to Data Management - EUDAT Summer School (Giuseppe Fiameni, CINECA)

  • 1.
    EUDAT receives fundingfrom the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 www.eudat.eu Linking HPC to Data Management Stéphane COUTIN (CINES) Giuseppe Fiameni (CINECA) This work is licensed under the Creative Commons CC-BY 4.0 licence
  • 2.
    Objectives High level presentationof research data management and H2020 context Present a simple approach and draft a DMP for a given case.
  • 3.
    THE CHANGING DATALANDSCAPE Image CC-BY-SA ‘data.path Ryoji.Ikeda - 3’ by r2hox www.flickr.com/photos/rh2ox/9990016123
  • 4.
    Data explosion More andmore data is being created Issue is not creating data, but being able to navigate and use it Data management is critical to make sure data are well-organised, understandable and reusable
  • 5.
    Digital data arefragile and susceptible to loss for a wide variety of reasons Natural disaster Facilities infrastructure failure Storage failure Server hardware/software failure Application software failure Format obsolescence Human error Malicious attack Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations Data loss Image CC-BY ‘Hard Drive 016’ by Jon Ross www.flickr.com/photos/jon_a_ross/1482849745
  • 6.
    Link rot –more 404 errors generated over time Reference rot* – link rot plus content drift i.e. webpages evolving and no longer reflecting original content cited * Term coined by Hiberlink http://hiberlink.org Data persistency issues Jonathan D. Wren Bioinformatics 2008;24:1381-1385
  • 7.
  • 8.
    Why manage researchdata? To make your research easier! To stop yourself drowning in irrelevant stuff In case you need the data later To avoid accusations of fraud or bad science To share your data for others to use and learn from To get credit for producing it Because funders or your organisation require it Well-managed data opens up opportunities for re-use, integration and new science
  • 9.
    H2020 open researchdata pilot • Already expanded from a select pilot to all work areas • All need to consider which data can be made open • Mantra = “As open as possible as closed as necessary” • Underlying driver is good (FAIR) data management Image CC-BY-SA by SangyaPundir
  • 10.
    Key requirements ofthe open data pilot Beneficiaries participating in the Pilot will: Deposit data in a research data repository of their choice Take measures to make it possible for others to access, mine, exploit, reproduce and disseminate the data free of charge Provide information about tools and instruments necessary for validating the results (where possible, provide the tools and instruments themselves) http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi /oa_pilot/h2020-hi-oa-data-mgt_en.pdf
  • 12.
    Suggested DMP creationprocess Analyse your project Information System Suggest : Data Flow Diagram Apply FAIR principles Include data life cycle and time dimensions Estimate costs Iterate Get funders support Maintain DMP up to date
  • 13.
    Simple diagram focusingon data dynamics You can use other diagram type DFD : Data Flow Diagram Data Processing Data store External interaction Data Flow
  • 14.
    You and yourteam are submitting a proposal for a project in the domain of smart cities. The City has implemented a large set of sensors measuring traffic. The data are collected in the City datacenter. You want to develop an application being able to forecast the traffic and also how it will be impacted by events like planned roadworks. This application would run on a PRACE site, not located in the City. On the PRACE site your storage space is limited to 10 TB. The application uses the following inputs: Sensors historical data over the last 12 months : sensors produce 1TB of data a day. You implement a preprocessing module translating those data into a reduced data set (10 MB per day). It is based on a format you have defined to describe the traffic. The results provided by the simulation. This enables comparison between forecasted and actual traffic in order to ‘train’ the application. Weather data (historical and forecast) provided by the national meteo agency. They use the SYNOP format. The volume is negligible. Results will be accessible by the city council employees. Create the project data flow diagram and fill the data summary chapter using a table. What would you appreciate to use efficiently the weather data? Exercise – Phase 1
  • 15.
    Data summary table DatasetDescription Origin? Existing? Format Size Who could use it?
  • 16.
    Proposed data flowdiagram Sensors collection area PRACE HPC Site Simulations PRACE Storage Output files extractor Input files Raw sensor data Data Preprocessing Reduced sensor data Weather data City council employees Data transfer
  • 17.
    Data summary table DatasetDescription Origin? Existing? Format Size Who could use it? Raw sensor data Available, collected from sensors Various 1TB per day Reduced sensor data Actual traffic, … Extracted from raw sensor data Binary (specific) 10 MB a day Our simulation Weather data Actual and forecast Existing. Meteo open data platform SYNOP 1MB a week Our simulation Citizens, scientists, .. Simulation results Forecasted traffic Results of our simulation Binary (specific) 10 MB a day City council employees, our application
  • 18.
    CREATING DATA PROCESSING DATA ANALYSING DATA PRESERVING DATA GIVING ACCESS TO DATA RE-USING DATA Research datalifecycle CREATING DATA: designing research, DMPs, planning consent, locate existing data, data collection and management, capturing and creating metadata RE-USING DATA: follow- up research, new research, undertake research reviews, scrutinising findings, teaching & learning ACCESS TO DATA: distributing data, sharing data, controlling access, establishing copyright, promoting data PRESERVING DATA: data storage, back- up & archiving, migrating to best format & medium, creating metadata and documentation ANALYSING DATA: interpreting, & deriving data, producing outputs, authoring publications, preparing for sharing PROCESSING DATA: entering, transcribing, checking, validating and cleaning data, anonymising data, describing data, manage and store data Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
  • 19.
    Bitstream Persistent Identifier Metadata Digital objectscan be aggregated to digital collections What is a digital object?
  • 20.
  • 21.
  • 22.
    A file formatis a convention on how a data is represented on a media. It can be: Specified: a description of the convention exists, and is sufficiently described to allow a complete implementation of it; Open: the convention is available without any restrictions of access or implementation; Standardized: the convention has been adopted by standardization agencies (ISO, W3C). Example: PDF/A. A wide utilization of a format can also enable it to be considered as a standard, even if there’s no official standard for it. Example: PDF. Proprietary: those formats depend on the existence of an owner. They can be published. Example: Word. The level of durability of a format depends on these criteria. Data formats
  • 23.
    Through a webinterface, this tool enables the verification of a file, especially its validity and if it’s well- formed against the specifications of the declared format, to know if it can be archived. You just have to download the file you want to test. The file is then analyzed by the tool which sends automatically the answer. If the file is not well-formed or not valid, tutorials to help correcting the file are available for the user. If the problem is not resolved, the user can contact the CINES expertise by e-mail. The list of the file formats accepted in PAC (CINES Arrchiving Platform) is available on FACILE (https://facile.cines.fr/ ) FACILE : a format validation tool
  • 24.
    Complexity and diversityof file formats A few ‘pivot’ formats HDF NetCDF A lot of specific binaries formats Need to document the format Store or reference documentation in the digital object Store or reference code HPC data formats
  • 25.
    Licensing research data •Horizon 2020 guidelines point to CC-BY or CC-0 • EUDAT licensing wizard help you pick licence for data & software (available in B2SHARE) • DCC How-to guide helps you to license data www.dcc.ac.uk/resources/how-guides/license-research-data
  • 26.
    Commonly defined as‘data about data’, metadata helps to make data findable and understandable Metadata can be: Descriptive: information about the content and context of the data Structural: information about the structure of the data Administrative: information about the file type, rights management and preservation processes What is metadata?
  • 27.
    Comprehensive metadata will: Facilitatedata discovery Help users determine the applicability of the data Enable interpretation and reuse Allow any limitations to be understood Clarify ownership and restrictions on reuse Offer permanence as it transcends people and time Provide interoperability Why use metadata?
  • 28.
    The good andthe bad Metres / seconds 2015-09-10T15:00:01+01:00 Longitudinal wind speed PDF 1.7 2008 US Population statistics Barcelona, Venezuela Furlongs and fortnight 10th Sept. 2015 15:00:01 U PDF Population statistics Barcelona More precise and standardised Ambiguous
  • 29.
    Digital preservation context 39 Mainrisks deal with: • Comprehension • Integrity • Exploitation • Valorization Quality assurance procedures to be setup for • Metadata • File formats • Representation information • Storage • Access • Technology watching
  • 30.
    Digital preservation challenges 40 Setupquality assurance procedures to mitigate the impact of the four main identified risks when they occur Challenge Solutions Loss of content knowledge • Metadata; • Persistent, unique identifiers. File format obsolescence • Handling of a limited set of durable formats; • File format identification, validation; • Logical migration (format conversion). Storage media failure • Management of media ageing; • Physical migration. Software or hardware disappearance • Technology watching , anticipation , proactivity. More details at https://www.cines.fr/en/long-term-preservation/
  • 31.
    Certifications Certification can helpselecting a repository Certification focuses on: Organizational infrastructure Digital object management Technology Usually refers to OAIS model
  • 32.
    OAIS (Open ArchivalInformation System) model Framework for an archive, now ISO 14721 Defines a functional and an informations models
  • 33.
    Repository certification :Data Seal of approval 16 quality guidelines for researchers and institutions that create digital research files, organizations that archive research files, and users of research data. The objectives of the Data Seal of Approval are to safeguard data, to ensure high quality and to guide reliable management of research data for the future without requiring the implementation of new standards, regulations or high costs. The DSA Gives researchers, research sponsors the assurance that their research results will be stored in a reliable manner and can be reused Allows data repositories to archive and distribute research data efficiently Is part of a European Framework for Audit and Certification of Trusted Repositories Online application and self-assessment of the 16 guidelines by the repository Review by a member of the DSA Board
  • 34.
    Formal certification: ISO16363 ISO 16363 – « Audit and certification of trustworthy digital repositories » Evaluation criteria for an auditor to judge if a repository is trustworthy) Published in 2012 Strongly based on OAIS reference model ISO 16919:2014 – « Requirements for bodies providing audit and certification of candidate trustworthy digital repositories » specifies requirements for bodies providing ISO 16363 audit and certification – provide detailed competences that auditors need
  • 35.
    www.eudat.eu Thanks – anyquestions Acknowledgements: Thanks to Mark van de Sanden, Marjan Grootveld , Sarah Jones and Giuseppe Fiameni for some of the slides