How to describe a dataset. Interoperability issues

How to describe a dataset.
Interoperability issues
Valeria Pesce
Global Forum on Agricultural Research

Definition of “dataset”
The term “dataset” has been defined in several ways, all of which
further specify or extend the basic concept of “a collection of data”.
Definition given by the W3C Government Linked Data Working Group:
A dataset is “a collection of data, published or curated by a
single source, and available for access or download in one or
more formats”
The “instances” of the dataset “available for access or
download in one or more formats” are called
“distributions”. A dataset can have many distributions.
Examples of distributions include a downloadable CSV
file, an API or an RSS feed.

Definition of “interoperability”
“Data interoperability is a feature of datasets -
and of information services that give access to
datasets - whereby data can easily be retrieved,
processed, re-used, and re-packaged
(“operated”) by other systems.”
Interim Proceedings of International Expert Consultation on “Building the CIARD
Framework for Data and Information Sharing”, CIARD (2011)
software applications
datasets have to be machine-readable

What applications need
Besides information common to any type of resource (name, author /
owner, date…), applications have to find enough metadata about
datasets to understand:
1. the specific coverage of the dataset (type of data, thematic
coverage, geographic coverage)
2. the necessary technical specifications to retrieve and parse a
distribution of the dataset (format, protocol etc.)
3. the conditions for re-use (rights, licenses)
4. the “dimensions” covered by the dataset (e.g. temperature,
time, salinity, gene, coordinates)
5. the semantics of the dimensions (units of measure, time
granularity, syntax, reference taxonomies)

Partial answers in existing vocabularies
• DCAT vocabulary
– RDF vocabulary for describing any dataset
– Datasets can be standalone or part of a “catalog”
– Datasets are accessible through several “distributions”
– “Other, complementary vocabularies may be used together with DCAT to provide
more detailed format-specific information. For example, properties from the VoID
vocabulary can be used if that dataset is in RDF format.”
• VOID vocabulary
– RDF vocabulary for expressing metadata about RDF datasets
• (SDMX ) DataCube vocabulary
– RDF vocabulary for describing statistical datasets
– Useful for attaching metadata about the “data structure” to any dataset that
doesn’t follow a known published standard

Coverage of a dataset
• This can be handled by common Dublin Core properties like subject and
coverage.
• DCAT re-uses these DC properties.
Issue 1: No specific property for the type of data covered in a dataset
The values of these properties have to be understood by machines:
- The value should be standardized, possibly a URI
- The URI should be de-referenceable to a thing
- The thing should be part of an authority list / taxonomy
Issue 3: There is no authority vocabulary for types of data
Issue 1
Issue 2

Conditions for re-use
• DCAT re-uses the license DC property at the level of
distributions
• DCAT re-uses the rights DC property at bith the level
of dataset and the level of distribution
dc:license > dc:LicenseDocument
dc:rights > dc:RightsStatement

Technical properties
The necessary technical specifications to retrieve and
parse a distribution of a dataset (format, protocol etc.)
• DCAT re-uses the DC format property;
Issue No property for protocol
The values of these properties have to be understood by
machines, possibly URIs:
Issue2 No comprehensive RDF authority lists for these
values (partial: DC Types; non-RDF: IANA types)
Issue 1
Issue 2

VOID
VOID can help with the protocol metadata but only for
RDF datasets:
- Property for data dump: dataDump
- Property for SPARQL endpoint: sparqlEndpoint

“Dimensions” and their semantics
DCAT does not describe the dimensions of a dataset,
except for a reference to a standard if the dataset
dimensions can be defined by a formalized standard
(e.g. an XML schema or an RDF vocabulary or an ISO
standard)
dc:conformsTo > dc:Standard
Statistical vocabularies can help
with the description of the dimensions

SDMX: data structure and dimensions
SDMX: Statistical Data and Metadata Exchange
The data structure definition is a description of all the metadata needed to
understand the data set structure.
This includes:
• identification of the dimensions (Dimension) according to standard
statistical terminology,
• the key structure (KeyDescriptor),
• the code-lists (CodeList) that enumerate valid values for each dimension
• coded attribute (CodedAttribute), information about whether attributes
are required or optional and coded or free text.
Given the metadata in the data structure definition, all of the data in the
data set becomes meaningful.

DataCube: simplified SDMX in RDF

Reference to a concept scheme

“Semantic role” of the property

“Semantic role” of

Combining different vocabularies
Name
URL
Owner
Content type
Topic(s)
Language
Metadata set(s)
Data structure
Distribution(s)
[…]
DATASET
Name
Protocol
Endpoint URL
Media type
Format
Size
DISTRIBUTION
DCAT model
Dimensions
Attributes
Measures
Value lists
DATA STRUCTURE
DataCube model
Catalog: the directory
Vocabulary(ies)
SPARQL endpoint
Data dump
Serialization format
Number of triples
RDF dataset info
VOID properties
If one or more known
published metadata sets
are used, just fill
“metadata set(s)”,
otherwise link to a “data
structure” with custom
“dimensions”
IF media type has RDF
or SPARQL response

Tools for managing dataset metadata
• CKAN maintained by the Open Knowledge Foundation
Uses most of DCAT. Doesn’t describe dimensions.
Also provides a global dataset hub called the Datahub
• Dataverse created by Harvard University
Uses a custom vocabulary. Doesn’t describe dimensions.
• Commercial solutions
• Repositories and catalogs:
OpenAIRE, DataCite (using re3data to search repositories) and Dryad
use their own vocabularies.
• CIARD RING
Uses full DCAT AP with some extended properties (protocol, data
type) and local taxonomies with URIs mapped when possible to
authorities.
Next steps: adding DataCube properties for dimensions.

Major outstanding issues
• Some missing properties in existing vocabularies:
 approach vocabulary owners OR extend vocabularies
• Missing vocabularies for protocols, formats
 approach standardizing bodies?
 perhaps specific dataset formats?
• Need for more standardized semantics for
dimensions:
 Joint discussions with the RDA Data Type Registries WG?
• Lack of interoperability metadata in existing tools

References
• W3C DCAT: http://www.w3.org/TR/vocab-dcat/
• DCAT AP: https://joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-
application-profile-data-portals-europe-final
• DataCube: http://purl.org/linked-data/cube#
• VOID: http://rdfs.org/ns/void-guide
• VIVO Datastar: http://sourceforge.net/projects/vivo/files/Datastar%20ontology/
• CERIF for datasets: https://cerif4datasets.wordpress.com/c4d-deliverables/
• CKAN: http://ckan.org/
• Datahub: http://datahub.io/
• DataCite: http://search.datacite.org/ui?q=subject%3Aagriculture
• Re3data: http://www.re3data.org
• Dryad: http://datadryad.org/
• OpenAIRE: https://www.openaire.eu/

Thank you
Valeria Pesce
Global Forum on Agricultural Research

How to describe a dataset. Interoperability issues

More Related Content

What's hot

Viewers also liked

Similar to How to describe a dataset. Interoperability issues

More from Valeria Pesce

Recently uploaded

How to describe a dataset. Interoperability issues