Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data discovery through federated dataset catalogs

152 views

Published on

Presented at the eROSA workshop, Montpellier, 6-7 July 2017.

Published in: Data & Analytics
  • Be the first to comment

Data discovery through federated dataset catalogs

  1. 1. Data discovery through federated dataset catalogs Valeria Pesce Secretariat of the Global Forum on Agricultural Research (GFAR) Secretariat of the Global Open Data for Agriculture and Nutrition (GODAN) initiative eROSA workshop, Montpellier, 6-7 July 2017
  2. 2. • Many institutional catalogs / geographically-scoped catalogs / thematic catalogs • How many catalogs do I have to search? >> General meta-catalogs? Different targeted catalogs? >> Federated metadata catalogs / Secondary catalogs 1. Dataset discovery: how • Data are in datasets, stored in some dataset repository • Datasets can be made searchable through a dataset catalog Good dataset metadata at the level of the local repository / catalog Open interoperable dataset metadata at the level of the primary repository / catalog IDEAL >>> Linked Data federated search engines LOD-enabled primary catalog Heavy requirements for the local primary catalog
  3. 3. 1. Dataset discovery: good metadata (1) 1. General metadata about the dataset “resource”: a) identifier(s) b) who is responsible for it c) when and where the data were collected d) relations to organizations, persons, publications, software, projects, funding… e) the conditions for re-use (rights, licenses) f) provenance, versions g) the specific coverage of the dataset (type of data, thematic coverage, geographic coverage) Normally covered by generic vocabularies like Dublin Core or DCAT IDEAL Let’s look at existing good practices and standards
  4. 4. 1. Dataset discovery: good metadata (2) a) The variables: the observed “dimensions” (e.g. time, geographic region, gender, elevation…) and the measured / observed phenomenon (e.g. life expectancy) b) The specification of the dimensions (units of measure, time granularity, syntax, any scaling factors and metadata such as the status of the observation, reference taxonomies…) c) Possible time and space slices; subsets Not always considered in generic dataset metadata vocabularies (DCAT) but traditionally included in research datasets (e.g. in formats like NetCDF) and covered by DataCube IDEAL 2. Metadata about the data structure!
  5. 5. 1. Dataset discovery: good metadata (3) 1. Where to retrieve the dataset: URL (data dump, service…) 2. The necessary technical specifications to retrieve and parse a distribution of the dataset: - format (file format, data format), vocabularies / data dictionaries - protocol, API parameters… Not always considered in generic dataset metadata vocabularies: DCAT covers data dump and format, VOID some services IDEAL 3. Metadata about the actual “serializations” or “distributions” of the dataset. Data will be processed by tools! Data formats and access protocols are important.
  6. 6. 1. Dataset discovery: interoperable metadata Secondary catalogs have to be able to retrieve metadata from the dataset catalog IDEAL Ideally, secondary catalogs would be able to retrieve only subsets of the catalog (by type of data, by data format, by phenomenon observed?) Data service / API with filtering parameters Catalogs as DAAS - Data-as-a-Service • All discovery-relevant metadata are exposed in machine-readable form • Exposed metadata use shared semantics • Standardization of the values, e.g. for “thematic coverage” or “dimensions” of datasets, “format” or “protocol used” of distributions etc. • The value should be standardized, possibly a URI • The value should be part of an authority list / code list
  7. 7. 1. Dataset discovery: ideal architecture Conclusions • Dataset metadata ideally created by authors / curators at the local level, catalog associated with repository • High-quality metadata in catalogs allowing for answers to all possible queries • Ownership, rights, temporal, spatial, thematic, data structure, access… • Machine-readable metadata; agreed vocabularies; shared semantics; APIs for querying • General or specialized secondary catalogs federate metadata from primary catalogs; multiply discoverability and cater for different audiences • Also secondary catalogs expose good metadata and APIs • There’s an inventory / registry of dataset repositories and all types of catalogs IDEAL
  8. 8. 2. Dataset discovery: current situation in Agriculture (1) • Institutional data repositories are picking up (need for an inventory!) CURRENT • Use of standardized or semi-standardized data repository tools with cataloguing functionalities and APIs is picking up (Dataverse, CKAN…) • Some governmental metadata catalogs exist, often using standardized tools (CKAN) and standard vocabularies (DCAT), that include agricultural datasets • Some international data catalogs exist that include agricultural datasets (re3data, OpenAIRE, DataHub…) • Also research-oriented data services like OpenDAP or Unidata THREDDS • Some secondary federated catalogs exist (? Need for an inventory!) • General one for agriculture (usable as an inventory): the CIARD RING
  9. 9. 2. Dataset discovery: current situation Example of CIARD RING secondary catalog • Architecture: • Datasets can be hosted anywhere, the RING only hosts the metadata • Optionally, datasets can be uploaded and the RING can act as a subsidiary repository • Datasets (metadata) can be federated from other catalogs • It uses the dataset / distribution DCAT model • Metadata quality: • it uses a combination of the DCAT model + the VOID vocabulary and the DataCube vocabulary + some extra properties (? a “RING DCAT profile” will be published) • Shared semantics: • it has a Linked Data layer, URIs for all entities; all categories are published as SKOS concepts in SKOS concept schemes and are mapped to external concepts whenever possible
  10. 10. 2. Dataset discovery: current situation in Agriculture (2) • Metadata quality of most used primary data catalog tools is not high • E.g. no metadata about data structure, no shared semantics for data types, topics, formats, standards used CURRENT >> poor discovery services in secondary catalogs • Metadata interoperability of most used primary data catalog tools is not high • No full compliance with broadly recognized vocabularies (DCAT, DataCube…) • No functionality to apply shared semantics for categorizations like topics, data types, formats, dimensions (sometimes keywords from AGROVOC) • Data not always accessible through exposed metadataNo full compliance with broadly recognized vocabularies (DCAT, DataCube…) • Scarce population >> lack of reputation / authority of secondary catalogs.> lack of motivation to share
  11. 11. 3. Dataset discovery: infrastructural improvements (1) Quality depends on the metadata coming from the primary repository / catalog… how can a good infrastructure overcome this problem? • Advocacy for better / improved tools? • Promote improvement of existing tools? • Dataset repository / catalog platforms in the cloud? • Complementary / subsidiary role of secondary catalogs? • Allow subsidiary use of secondary catalogs as primary catalog and even repository for some datasets (small institutions, individuals) • Cater or the improvement of metadata directly in the secondary catalogs • Incentives to provide good metadata? • E.g. offer mechanisms to a) measure reuse; b) enforce respect of usage rights.
  12. 12. • Good agreed metadata standards and reference value vocabularies • Combine existing standards (DCAT, DataCube, VOID…) in an application profile? • Provide a reference framework of agreed value vocabularies with URIs? Mapping from local values to agreed ones? >> AgriSemantics – GACS, VEST/AgroPortal • Avoid too much interdependence. Design a loosely coupled infrastructure. (How?) 3. Dataset discovery: infrastructural improvements (2)
  13. 13. Key questions • Is it our task to aim at having better machine-readable metadata at the level of the primary local repository / catalog? • How can we influence this? Advocate for including metadata in researcher’s tools? • Do we want to “drive” secondary catalogs or let them bloom? Or both? • At least a global one for food&ag? How many? Who decides? Who manages them? • How can other infrastructural components facilitate good catalogs? • Subsidiary metadata in secondary catalogs? Good dataset catalog tools in the cloud? • Good agreed metadata standards and reference value vocabularies? • Mapping with local values • How to design for resilience of the system? Loosely coupled components? • How much of this is specific to food&ag and which aspects should be tackled in a broader context? (EOSC?)
  14. 14. Data discovery through dataset repositories and catalogs Thank you for your attention eROSA workshop, Montpellier, 6-7 July 2017

×