Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dataset description: DCAT and other vocabularies

612 views

Published on

Presented at the DRTC/ISI-ICSU/CODATA International Workshop on Open Data Repositories, 1 March 2017

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Dataset description: DCAT and other vocabularies

  1. 1. Dataset description: DCAT and other vocabularies Valeria Pesce Secretariat of Global Forum on Agricultural Research (GFAR) and Secretariat of the Global Open Data for Agriculture and Nutrition (GODAN) initiative DRTC/ISI-ICSU/CODATA International Workshop on Open Data Repositories
  2. 2. Data and datasets Wikipedia “Data are values of qualitative or quantitative variables, belonging to a set of items. Data as an abstract concept can be viewed as the lowest level of abstraction from which information and then knowledge are derived.” Wikipedia: A dataset (or data set) is a collection of data. (Narrow definition) …the contents of a single database table, or a single statistical data matrix, where each column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question. […] Nontabular datasets can take the form of marked up strings of characters, such as an XML file. W3C Government Linked Data Working Group (DCAT vocabulary): http://www.w3.org/TR/vocab-dcat/#class--dataset A collection of data, published or curated by a single source, and available for access or download in one or more formats. DATA DATASETS
  3. 3. Single dataset • As long as you consider a dataset alone, you may not need structured metadata about the dataset for further re-use as long as the dataset uses sector standards and is documented in some way • When managing a single dataset or a few homogeneous datasets between colleagues, it may be enough to use sector-specific standards (e.g. Multi-crop descriptors and Darwin Core for germplasm, INSPIRE for soil/geo, “Minimum Set” recommendations for observations, sector code lists / vocabularies) or application-specific standards
  4. 4. Datasets in repositories But what happens when you have a big data repository with heterogeneous datasets? Or you have few datasets but you want to contribute them to a huge data repository/catalog where many other datasets are? Users will want to find your dataset among many others, possibly together with datasets with similar data using the same standards / measures / syntax They will use tools to search for datasets Then dataset metadata (or a machine-readable dataset description) becomes important.
  5. 5. Why machine-readable descriptions • Data will be re-used by applications Others will search and make use of your data through tools  Datasets have to be found by applications  Datasets have to be understood by applications • Datasets should be managed in data repositories / data catalogs  Data catalogs have to provide enough dataset metadata to applications to allow them to find and understand datasets • Data catalogs themselves are implemented as applications, so they need machine-readable dataset metadata
  6. 6. Ref / ID / URI Dimension2 Dimension3 Dimension4 Value1.1 Value1.2 Value1.3 Value1.4 Value2.1 Value2.2 Value2.3 Value2.4 Value3.1 Value3.2 Value3.3 Value3.4 Value4.1 Value4.2 Value4.3 Value4.4 Dataset Datum Record, “member”, observation File system Data repository DatasetRef Matadata1 Metadata2 Metadata3 Ref1 Address1 Value1.2 Value1.3 Ref2 Address2 Value2.2 Value2.3 Data catalog (also a dataset) Tabular only for the sake of simplification, it could be triples or other data structures CatalogID Value1 Value2 Value3 Search Export Data type We only focus on the data catalog level
  7. 7. Dataset metadata So, what metadata do applications need to find in data catalogs?
  8. 8. Dataset metadata for applications 1) General metadata about the dataset: 1) identifier(s) 2) who is responsible for it 3) when and where the data were collected 4) relations to organizations, persons, publications, software, projects, funding… 5) the conditions for re-use (rights, licenses) 6) provenance, versions 7) the specific coverage of the dataset (type of data, thematic coverage, geographic coverage) 8) time and space slices; subsets 9) the “dimensions” and “variables” covered by the dataset 10) the semantics of the dimensions (units of measure, time granularity, syntax, reference taxonomies)
  9. 9. WHY dimensions and semantics Example of search by researcher: I’m doing research on plant phenotypes: give me all datasets of crop phenotypic data that include the dimensions of time, geographic location and height, plus units of measure used for time and height, where geographic location is expressed as coordinates (because my software only processes coordinates)
  10. 10. Data seralizations • The metadata above refer to the collection as a whole, but additional technical metadata refer to the different “serializations” of the data… • In many dataset description models, the metadata about the data serialization is attached to the dataset • In other dataset description models, information about the data serializations is not considered inherent to the nature and content of the data collection, so it is not attached to the dataset but rather to related entities called “distributions”
  11. 11. Distribution metadata for applications Applications have to find metadata about the actual “serializations” or “distributions” of the dataset to understand: 1. Where to retrieve it: URL (data dump, service…) 2. the necessary technical specifications to retrieve and parse a distribution of the dataset: - format (file format, data format) - protocol, API parameters… And, if different for different distributions, again: 3. the conditions for re-use (rights, licenses) 4. the semantics of the dimensions (units of measure, time granularity, syntax, reference taxonomies) if different semantics for different distributions
  12. 12. WHY protocols and API params Example: Some datasets are available behind a service: e.g. RDF datasets are often retrieved as subsets of an RDF store through SPARQL queries; or national research institutes provide access to datasets behind a web service, accepting parameters to filter the datasets Use case: I have an application that can fetch from several SOAP web services (protocol) at once automatically if it knows the parameters required by the SOAP service and the required syntax and type of the parameters
  13. 13. General issue with all metadata Standardization of the values, e.g. for “thematic coverage” or “dimensions” of datasets, “format” or “protocol used” of distributions etc. - The value should be standardized, possibly a URI - The value should be part of an authority list / code list And… There is no authority “value vocabulary” or code list for many of these values
  14. 14. No out-of-the-box solution • Do existing data catalog tools normally cover these metadata? NO • Do existing metadata vocabularies (RDF and not) cover all these metadata? Or do they adopt the same model? NO BUT by using even basic metadata to describe datasets in data catalogs “publishers increase discoverability and enable applications to consume metadata from multiple catalogs. This further enables decentralized publishing of catalogs and facilitates federated dataset search across sites.” (from W3C page on “Best Practices for Publishing Linked Data”)
  15. 15. Dataset description vocabularies Let’s see if the main vocabularies to describe datasets provide the metadata we think are needed for full interoperability
  16. 16. Semantic interoperability In this presentation we cover only RDF vocabularies with special focus on semantic interoperability Dataset metadata have been managed in several ways before semantic technologies: see NetCDF or HDF5 structures and various hierarchical array-based structures used especially in observations datasets – including dataset metadata at the top and data arrays below.
  17. 17. Dataset description vocabularies • DCAT vocabulary • RDF vocabulary for describing any dataset • Datasets can be standalone or part of a “catalog” • Metadata about dataset (collection) and related distributions • DataCube vocabulary • RDF vocabulary for describing statistical datasets • Useful for attaching metadata about the “data structure” of a dataset • VOID vocabulary • RDF vocabulary for expressing metadata about RDF datasets • Useful especially for metadata related to RDF data services
  18. 18. Definition of “dataset” in DCAT Definition given by the W3C Government Linked Data Working Group: A dataset is “a collection of data, published or curated by a single source, and available for access or download in one or more formats” The “instances” of the dataset “available for access or download in one or more formats” are called “distributions”. A dataset can have many distributions. Examples of distributions include a downloadable CSV file, an API or an RSS feed.
  19. 19. c DCAT model 1) identifier(s) 2) who is responsible for it 3) when and where the data were collected 4) relations to organizations, persons, publications, software, projects, funding… NO 5) the conditions for re-use (rights, licenses) 6) provenance, version NO 7) coverage of the dataset 8) dimensions and semantics NO 9) slices, subsets NO 10) URL 11) Format 12) Protocols, parameters NO
  20. 20. DCAT and DCAT-AP The DCAT Application profile for data portals in Europe (DCAT-AP) is an extension of DCAT It combines DCAT with the W3C Asset Description Metadata Schema (ADMS) vocabulary, plus classes and properties from Dublin Core, SKOS and Vcard, in an Application profile. The elaboration of the DCAT-AP was a joint initiative of DG CONNECT, the EU Publications Office and the ISA Programme. A diagram, of the full DCAT-AP specification is on the next slide
  21. 21. Full DCAT AP versions rights and provenance standards rights format relation 1) who is responsible for it: MORE 2) relations to organizations, projects, publications, funding: Partly 3) the conditions for re-use (rights, licenses) MORE 4) provenance, version YES 5) protocols, parameters NO 6) dimensions and semantics NO More than DCAT:
  22. 22. Limitations of DCAT It doesn’t cover: • semantic relations to organisations, persons, software, projects, funding… • dimensions and variables and syntax / semantics of dimensions and variables • protocols and parameters for datasets available through APIs • time and space slices, subsets
  23. 23. Combining DCAT with other vocabularies “Other, complementary vocabularies may be used together with DCAT to provide more detailed format- specific information. For example, properties from the VoID vocabulary can be used if that dataset is in RDF format.” (from the DCAT specification)
  24. 24. DataCube: structure definition A cube is organized according to a set of dimensions, attributes and measures. • The dimension components serve to identify the observed dimensions (e.g. time, geographic region, gender, elevation…). • The measure components represent the phenomenon being observed (e.g. life expectancy). • The attribute components enable specification of the units of measure, any scaling factors and metadata such as the status of the observation (e.g. estimated, provisional).
  25. 25. DataCube model for dataset structure This part of the model could be re- used for describing the dimensions of any dataset, also non- statistical 1) dimensions and semantics YES 2) slices, subsets YES More than DCAT-AP:
  26. 26. VOID model c dct:license wv:norms, wv:waiver 3) URL MORE 4) the conditions for re-use (rights, licenses) MORE 5) Protocols, parameters Partly 6) dimensions and semantics of dimensions Partly 7) slices, subsets YES More than DCAT-AP:
  27. 27. Complementing DCAT • For dimensions, semantics of dimensions, slicing: • DataCube • DDI • For API aspects: • VOID (linked data) • Web services descriptions (Hydra, (WSDL, WADL)) • For relations to organizations, projects, publications, funding… • CERIF for datasets • VIVO Datastar
  28. 28. Many vocabularies Vocabularies with relations to DCAT or same model • DCAT-AP and other extensions • W3C HCLS dataset descriptions; DataID extension with capabilities to describe dataset hierarchies, fine-grained technical details of datasets, dataset permissions, dataset distributions and machine-readable licensing information; GeoDCAT-AP geospatial extension of DCAT • DDI-RDF Discovery Vocabulary (mapped to Data Cube, DCAT and XKOS; DDI XML exportable from Dataverse) • Schema.org Other vocabularies: • DataCite and re3data • CERIF for datasets • VIVO Datastar • INSPIRE
  29. 29. Examples of application of DCAT • CKAN data catalog tool (more in your workshop) • Data catalogs: • data.gov, data.gov.uk, data.gov.au • Africa Open Data, Indonesia Data Portal • EU Data Portal • More here: https://ckan.org/instances/ • CIARD RING federated data catalog managed by GFAR
  30. 30. CIARD RING
  31. 31. Datasets in the RING dataset hub • Datasets can be registered as standalone sources or as part of a “collection” (DCAT model) • A dataset is identified by: uniform type of content, uniform data structure (dimensions / metadata set, encoding, reference value lists) • One dataset can be made available / accessible as different “distributions” (format, protocol, URL) • The RING uses a combination of the DCAT model + the VOID vocabulary and the DataCube vocabulary  a “RING DCAT profile” will be published
  32. 32. Sample dataset record
  33. 33. RDF of the record <rdf:Description rdf:about="http://ring.ciard.net/node/19517"> <rdf:type rdf:resource="http://www.w3.org/ns/dcat#Dataset"/> <rdf:type rdf:resource="http://rdfs.org/ns/void#Dataset"/> <rdf:type rdf:resource="http://schema.org/Dataset"/> <rdf:type rdf:resource="http://www.w3.org/ns/adms#Asset"/> <dct:title>National Soil Database representative Soil Systems geography</ns1:title> <schema:name>National Soil Database representative Soil Systems geography</ns2:name> <dct:description>National Soil Database representative Soil Systems geography (1:500.000) ...</ns1:description> <dcat:landingPage rdf:resource="http://soilmaps.entecra.it/ita/bancadati1.html"/> <schema:url rdf:resource="http://soilmaps.entecra.it/ita/bancadati1.html"/> <dct:publisher rdf:resource="http://ring.ciard.net/node/19510"/> <schema:publisher rdf:resource="http://ring.ciard.net/node/19510"/> <dct:issued rdf:datatype="xsd:gYear">1990</ns1:issued> <dct:spatial rdf:resource="http://ring.ciard.net/taxonomy_term/326"/> <dct:spatial>National</ns1:spatial> <dcat:theme rdf:resource="http://ring.ciard.net/taxonomy_term/2052"/> <schema:about rdf:resource="http://ring.ciard.net/taxonomy_term/2052"/> <dct:conformsTo rdf:resource="http://ring.ciard.net/node/19239"/> <dct:identifier>http://ring.ciard.net/node/19517.rdf</ns1:identifier> <schema:contentLocation rdf:resource="http://ring.ciard.net/taxonomy_term/326"/> <dct:type rdf:resource="http://ring.ciard.net/taxonomy_term/81"/> <dcat:catalog rdf:resource="http://ring.ciard.net/node/19436"/> <dcat:distribution rdf:resource="http://ring.ciard.net/field_collection_item/5055"/> <schema:distribution rdf:resource="http://ring.ciard.net/field_collection_item/5055"/> <void:dataDump rdf:resource="http://ring.ciard.net/field_collection_item/5055"/> <dct:rights rdf:resource="http://ring.ciard.net/taxonomy_term/2053"/> </rdf:Description>
  34. 34. References 1. Issues and Recommendations Associated with Distributed Computation and Data Management Systems for the Space Sciences National Research Council (U.S.). Space Science Board National Academies, 1986 - 111 pagine https://books.google.co.uk/books?id=h4krAAAAYAAJ 2. Sacchi, S., Wickett, K. M., & Renear, A. H. (2010). Dataset definitions. Champaign, IL: Center for Informatics Research in Science and Scholarship. (Rep. No. CIRSS/DATACONS--2010/1/VER01+DCDC) 3. Alexander, K., Cyganiak, R., Hausenblas, M., & Zhao, J. (2009). Describing Linked Datasets-On the Design and Usage of voiD, the'Vocabulary of Interlinked Datasets'. In Linked Data on the Web Workshop (LDOW 09), in conjunction with 18th International World Wide Web Conference (WWW 09). 4. Renear, A. H., Sacchi, S., Wickett, K. M. (2010). Definitions of Dataset in the Scientific and Technical Literature. http://mail.asist.org/asist2010/proceedings/proceedings/ASIST_AM10/submission s/240_Final_Submission.pdf 5. W3C Government Linked Data Working Group http://www.w3.org/2011/gld/wiki/Main_Page 6. UK Gov Linked Data Working Group: LD registry https://github.com/der/ukl-registry-poc/wiki
  35. 35. Vocabularies and catalogs • DCAT: http://www.w3.org/TR/vocab-dcat/ • DCAT AP: https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application- profile-data-portals-europe-final • DataCube: http://purl.org/linked-data/cube# • VOID: http://rdfs.org/ns/void-guide • DDI-RDF Discovery Vocabulary: http://rdf-vocabulary.ddialliance.org/discovery.html • VIVO Datastar: http://sourceforge.net/projects/vivo/files/Datastar%20ontology/ • CERIF for datasets: https://cerif4datasets.wordpress.com/c4d-deliverables/ • CKAN: http://ckan.org/ • Dataverse: http://dataverse.org/ • Datahub: http://datahub.io/ • DataCite: http://search.datacite.org/ui?q=subject%3Aagriculture • Re3data: http://www.re3data.org • OpenAIRE: https://www.openaire.eu/ • CIARD RING: http://ring.ciard.info
  36. 36. Dataset description and DCAT DRTC/ISI-ICSU/CODATA International Workshop on Open Data Repositories Thank you Valeria Pesce valeria.pesce@fao.org

×