eROSA Stakeholder WS1: Data discovery through federated dataset catalogues

Data discovery
through federated
dataset catalogs
Valeria Pesce
Secretariat of the Global Forum on Agricultural Research (GFAR)
Secretariat of the Global Open Data for Agriculture and Nutrition (GODAN) initiative
eROSA workshop, Montpellier, 6-7 July 2017

• Many institutional catalogs / geographically-scoped catalogs / thematic catalogs
• How many catalogs do I have to search?
>> General meta-catalogs? Different targeted catalogs?
>> Federated metadata catalogs
/ Secondary catalogs
1. Dataset discovery: how
• Data are in datasets, stored in some dataset repository
• Datasets can be made searchable through a dataset catalog
Good dataset metadata at the level of the local repository / catalog
Open interoperable dataset metadata
at the level of the primary repository / catalog
IDEAL
>>> Linked Data federated search engines LOD-enabled
primary catalog
Heavy requirements for the
local primary catalog

1. Dataset discovery: good metadata (1)
1. General metadata about the dataset “resource”:
a) identifier(s)
b) who is responsible for it
c) when and where the data were collected
d) relations to organizations, persons, publications, software, projects, funding…
e) the conditions for re-use (rights, licenses)
f) provenance, versions
g) the specific coverage of the dataset (type of data, thematic coverage, geographic
coverage)
Normally covered by generic vocabularies like Dublin Core or DCAT
IDEAL
Let’s look at existing good practices and standards

a) The variables: the observed “dimensions” (e.g. time, geographic region, gender,
elevation…) and the measured / observed phenomenon (e.g. life expectancy)
b) The specification of the dimensions (units of measure, time granularity, syntax, any
scaling factors and metadata such as the status of the observation, reference
taxonomies…)
c) Possible time and space slices; subsets
Not always considered in generic dataset metadata vocabularies (DCAT) but traditionally
included in research datasets (e.g. in formats like NetCDF) and covered by DataCube
IDEAL
2. Metadata about the data structure!

1. Where to retrieve the dataset: URL (data dump, service…)
2. The necessary technical specifications to retrieve and parse a distribution of the
dataset:
- format (file format, data format), vocabularies / data dictionaries
- protocol, API parameters…
Not always considered in generic dataset metadata vocabularies: DCAT covers data
dump and format, VOID some services
IDEAL
3. Metadata about the actual “serializations” or “distributions” of the
dataset.
Data will be processed by tools! Data formats and access protocols are important.

1. Dataset discovery: interoperable metadata
Secondary catalogs have to be able to retrieve metadata from the dataset catalog
IDEAL
Ideally, secondary catalogs would be able to retrieve only subsets of the
catalog (by type of data, by data format, by phenomenon observed?)
Data service / API with filtering parameters Catalogs as DAAS - Data-as-a-Service
• All discovery-relevant metadata are exposed in machine-readable form
• Exposed metadata use shared semantics
• Standardization of the values, e.g. for “thematic coverage” or “dimensions” of
datasets, “format” or “protocol used” of distributions etc.
• The value should be standardized, possibly a URI
• The value should be part of an authority list / code list

1. Dataset discovery: ideal architecture
Conclusions
• Dataset metadata ideally created by authors / curator at the local level,
catalog associated with repository
• High-quality metadata in catalogs allowing for answers to all possible queries
• Ownership, rights, temporal, spatial, thematic, data structure, access…
• Machine-readable metadata; agreed vocabularies; shared semantics; APIs for
querying
• General or specialized secondary catalogs federate metadata from primary
catalogs; multiply discoverability and cater for different audiences
• Also secondary catalogs expose good metadata and APIs
• There’s an inventory / registry of dataset repositories and all types of catalogs
IDEAL

2. Dataset discovery: current situation
in Agriculture (1)
• Institutional data repositories are picking up (need for an inventory!)
CURRENT
• Use of standardized or semi-standardized data repository tools with
cataloguing functionalities and APIs is picking up (Dataverse, CKAN…)
• Some governmental metadata catalogs exist, often using standardized tools
(CKAN) and standard vocabularies (DCAT), that include agricultural datasets
• Some international data catalogs exist that include agricultural datasets
(re3data, OpenAIRE, DataHub…)
• Also research-oriented data services like OpenDAP or Unidata THREDDS
• Some secondary federated catalogs exist (? Need for an inventory!)
• General one for agriculture (usable as an inventory): the CIARD RING

Example of CIARD RING secondary catalog
• Architecture:
• Datasets can be hosted anywhere, the RING only hosts the metadata
• Optionally, datasets can be uploaded and the RING can act as a subsidiary repository
• Datasets (metadata) can be federated from other catalogs
• It uses the dataset / distribution DCAT model
• Metadata quality:
• it uses a combination of the DCAT model + the VOID vocabulary and the DataCube
vocabulary + some extra properties (? a “RING DCAT profile” will be published)
• Shared semantics:
• it has a Linked Data layer, URIs for all entities; all categories are published as SKOS
concepts in SKOS concept schemes and are mapped to external concepts whenever
possible

in Agriculture (2)
• Metadata quality of most used primary data catalog tools is not high
• E.g. no metadata about data structure, no shared semantics for data types, topics,
formats, standards used
CURRENT
>> poor discovery services in secondary catalogs
• Metadata interoperability of most used primary data catalog tools is not high
• No full compliance with broadly recognized vocabularies (DCAT, DataCube…)
• No functionality to apply shared semantics for categorizations like topics, data types,
formats, dimensions (sometimes keywords from AGROVOC)
• Data not always accessible through exposed metadataNo full compliance
with broadly recognized vocabularies (DCAT, DataCube…)
• Scarce population >> lack of reputation / authority of secondary catalogs.>
lack of motivation to share

3. Dataset discovery: infrastructural improvements (1)
Quality depends on the metadata coming from the primary repository /
catalog… how can a good infrastructure overcome this problem?
• Advocacy for better / improved tools?
• Promote improvement of existing tools?
• Dataset repository / catalog platforms in the cloud?
• Complementary / subsidiary role of secondary catalogs?
• Allow subsidiary use of secondary catalogs as primary catalog and even repository
for some datasets (small institutions, individuals)
• Cater or the improvement of metadata directly in the secondary catalogs
• Incentives to provide good metadata?
• E.g. offer mechanisms to a) measure reuse; b) enforce respect of usage rights.

• Good agreed metadata standards and reference value vocabularies
• Combine existing standards (DCAT, DataCube, VOID…) in an application
profile?
• Provide a reference framework of agreed value vocabularies with URIs?
Mapping from local values to agreed ones?
>> AgriSemantics – GACS, VEST/AgroPortal
• Avoid too much interdependence. Design a loosely coupled
infrastructure. (How?)
3. Dataset discovery: infrastructural improvements (2)

Key questions
• Is it our task to aim at having better machine-readable metadata at the level
of the primary local repository / catalog?
• How can we influence this? Advocate for including metadata in researcher’s tools?
• Do we want to “drive” secondary catalogs or let them bloom? Or both?
• At least a global one for food&ag? How many? Who decides? Who manages them?
• How can other infrastructural components facilitate good catalogs?
• Subsidiary metadata in secondary catalogs? Good dataset catalog tools in the cloud?
• Good agreed metadata standards and reference value vocabularies?
• Mapping with local values
• How to design for resilience of the system? Loosely coupled components?
• How much of this is specific to food&ag and which aspects should be tackled
in a broader context? (EOSC?)

Data discovery
through dataset repositories
and catalogs
Thank you for your attention
eROSA workshop, Montpellier, 6-7 July 2017

Some recommendations from EC High Level
Expert Group on EOSC (1)
“An Internet of data and services where containers with software
applications are routed to relevant data and vice versa” (B. Mons)
- Develop and sustain core data assets for the EOSC and make them
available to the community under well-defined conditions. These may
include workflows, analytics programmes and notably existing datasets
with FAIR status (including metadata creation)
- Support the development of one or more publicly available data search
engine(s) that find FAIR metadata across trusted EOSC repositories
- Develop technologies and approaches to meaningfully measure re-use
and scientific impact of Research Objects after their initial publication
(e.g. metrics that matter and get recognised)

Some recommendations from EC High Level
Expert Group on EOSC (1)
- Start dedicated efforts to prepare data and research objects for
inclusion in the EOSC
- Combine single sign-on issues with the connection of social and
professional people oriented web applications resulting in a federated
identity and credentials for all people in the EOSC
- A repository of research vocabularies and a software application to
support wider access, reuse and development of vocabularies thereby
enhancing interoperability

Diagram of India govt. Data-as-a-Service

Data-as-a-Service from an Elixir PPT

eROSA Stakeholder WS1: Data discovery through federated dataset catalogues

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to eROSA Stakeholder WS1: Data discovery through federated dataset catalogues

Similar to eROSA Stakeholder WS1: Data discovery through federated dataset catalogues (20)

More from e-ROSA

More from e-ROSA (20)

Recently uploaded

Recently uploaded (20)

eROSA Stakeholder WS1: Data discovery through federated dataset catalogues