The necessity of metadata for linked open data and its contribution to policy analyses #CeDEM12
1. The necessity of metadata for linked open data and
its contribution to policy analyses
Anneke Zuiderwijk*, Keith Jeffery**, Marijn Janssen*
*Delft University of Technology, The Netherlands
**Science and Technology Facilities Council, United Kingdom
CEDEM 2012, May 3-4
2. Open governmental data
0 "We are sending a strong signal to administrations today. Your
data is worth more if you give it away. So start releasing it
now.” (December 12, 2011)
European Commission Vice President Neelie Kroes, digital agenda:
Turning government data into gold)
0 One of many examples that shows that open governmental
data have gained considerable attention recently
CEDEM 2012
3. The ENGAGE project
0 ENGAGE (FP7): An Infrastructure for Open, Linked
Governmental Data Provision towards Research Communities
and Citizens (http://www.engage-project.eu)
0 Main goal: the development and use of a data infrastructure,
incorporating distributed and diverse public sector information
(PSI) resources.
0 The ENGAGE platform will enable researchers and citizens to:
0 Discover and browse datasets across diverse and dispersed public
sector information resources (local, national and European) in their
own language
0 Download the datasets
0 Perform geospatial search of datasets
0 Visualize properly structured datasets in data tables, maps and charts
CEDEM 2012
4. Open governmental data
0 Open governmental data can be defined as “all stored data of
the public sector which could be made accessible by
government in the public interest without any restrictions on
usage and distribution” (Geiger & Von Lucke, 2011, p. 185).
0 For example, public sector data can be:
0 Geographic data (e.g. cadastral information)
0 Legal data (e.g. courts decisions, legislation)
0 Meteorological data (e.g. climate data, weather forecasts)
0 Social data (e.g. population, public administration)
0 Transport data (e.g. traffic congestion, work on roads)
0 Business data (e.g. chamber of commerce, patents) (MEPSIR study,
Dekkers et al., 2006)
CEDEM 2012
5. Linked open data (LOD)
0 Focus on turning public sector PUBLIC SECTOR (POLICY)
data into LOD (1)
DATA METADATA
1. Public body produces data (and (2)
metadata) PUBLICATION ON THE
SEMANTIC WEB
2. Data become available on the
(3)
Web of Data / Semantic Web
REUSING OPEN DATA
3. Open data can be reused
(4)
4. Open data can be linked to other LINKING DATA
data show relationships
(5)
5. Data are both open and linked LINKED OPEN DATA
Linked Open Data (LOD)
Figure 1: Process for creating Linked Open Data
CEDEM 2012
6. Metadata
0 Metadata are part of the LOD-process
0 Metadata are needed to make sense of the open data (Berners-
Lee, 2009)
0 Metadata are defined as “structured information that
describes, explains, locates, or otherwise makes it easier to
retrieve, use, or manage an information resource.” (National
Information Standards Organization, 2004, p. 1).
0 Metadata provision in the ideal situation:
0 Discovery metadata, e.g. identifier, title, creator, keywords.
0 Contextual metadata, e.g. organizations, projects, funding.
0 Detailed metadata, e.g. quality and domain specific parameters.
CEDEM 2012
7. Why metadata are necessary in analyzing LOD
0 Metadata for LOD can be useful in the following situations.
Metadata:
0 create order within datasets;
0 improve storing and preservation of LOD;
0 improve easily finding LOD;
0 improve the accessibility of LOD;
0 may make it possible to assess and rank the quality of LOD;
0 improve easily analyzing, comparing, reproducing and therefore finding
inconsistencies in LOD;
0 improve chances of a correct interpretation of LOD;
0 improve the possibilities to find patterns in LOD to generate new
hypotheses;
0 may improve visualizing LOD;
0 make it easier to link data ;
0 avoid unnecessary duplication of LOD.
CEDEM 2012
8. Problem statement
0 Discrepancies between the benefits that are described in
literature and the benefits that are obtained in reality
0 Current situation is a long way from the ideal situation:
0 usually few and insufficient ways of managing metadata and
interpretation of LOD (for instance Hernández-Pérez et al., 2009;
Schuurman et al., 2008; Xiong et al., 2011);
0 adding metadata is often viewed as an additional activity that only
consumes resources.
0 Statements:
0 Merely linking data is not enough to make use of open data
0 Metadata are key enablers for the effective use of LOD in
policy-making
CEDEM 2012
9. Requirements for a metadata architecture
0 The metadata should:
0 be easily discovered;
0 interconvert common metadata formats used in PSI;
0 provide a LOD representation of the metadata for browsing
or query;
0 maintain the capabilities of conventional information
systems with structured query including convenient
primitive operations.
CEDEM 2012
10. Outline architecture
0 The requirements lead to the following architecture:
Portal server PORTAL
METADATA
RUNNING
SOFTWARE
APPLICATION
PSI PSI PSI
DATA- DATA- DATA-
Application Server
SET SET SET
PSI Dataset Servers
Figure 2: An architecture of a portal server for the provision of metadata.
CEDEM 2012
11. Metadata
0 Metadata should be used to implement this architecture
A 3-layer structure for metadata is used:
a) discovery (flat) metadata; for example:
0 Dublin Core (DC);
0 e-Government Metadata Standard (e-GMS);
0 Comprehensive Knowledge Archive Network (CKAN);
0 or similar ‘flat’ metadata
b) contextual metadata; uses the Common European Research
Information Format (CERIF) ;
c) detailed metadata.
CEDEM 2012
12. The Vision: Metadata for Data Model
DISCOVERY
Linked
open data (DC, eGMS…)
Generate
CONTEXT
(CERIF)
Formal Point to
Information
Systems DETAIL
(SUBJECT OR TOPIC SPECIFIC)
13. Design
The presented structure provides the next improved facilities:
0 CERIF provides a much richer metadata than the standards
used commonly with PSI datasets.
0 The representation of contextual metadata (CERIF) allows rich
semantics to be represented thus making the PSI datasets
understandable to the end user (or software) through the
metadata.
0 The Structured Query Language (SQL) has a simpler structure
than SPARQL and includes convenient primitive operations for
simple statistical calculations such as sum, count, average.
CEDEM 2012
14. Benefits of architecture
0 Because of the powerful expressive semantics over formal
syntax of CERIF we can:
0 Generate discovery metadata from CERIF;
0 Interconvert common metadata formats used in PSI using CERIF as the
superset exchange mechanism;
0 Provide a semantic web / LOD representation of the metadata for
browsing or query using SPARQL;
0 While maintaining a conventional information systems capability with
structured query including convenient primitive operations.
CEDEM 2012
15. Models for an infrastructure
0 The data model with its metadata described is only one
relevant model
0 The other models are:
0 User model
0 Processing model
0 Resource model
16. The Vision: The Models
User Model
Processing
Model
Data Model
Resource
Model
Complete cohort of users Complete ICT environment for PSI
17. Model – User model
0 User Model: controls the way in which the end-user interacts
with the e-infrastructure.
0 User profile, security certification, privacy;
0 Device and interaction mode preferences (keyboard/mouse through
voice and gesture to brain-connected), language preference;
0 Resource preferences (including contacts) with directories;
0 METADATA
18. Models – Processing model
0 Process Model controls the way processes are
constructed and executed in the e-infrastructure
0 Services
0 Described for discovery, described for functional and non-functional
(security, privacy, performance) properties
0 Mobile (deployed in distributed / parallel execution environments)
0 Open source where possible
0 Service composition
0 Dynamically (re-) composable during execution
0 METADATA
19. Models – Data model
0 Data Model controls data representation and data (re-)use
0 Formal syntax (structure)
0 Even for text, images, streamed video
0 Declared semantics (meaning)
0 METADATA
20. Models – Resource model
0 Resource Model catalogs the available computing
resources in the e-infrastructure
0 This allows virtualisation so the user neither knows nor cares from
where the data comes, or where the processing is done, as long as
quality of service is maintained;
0 Requires updating by resource owners – together with conditions of
use
0 METADATA
21. Conclusions (1)
0 Metadata are needed to make sense of the open data
0 Merely linking data is not enough to make optimal use of open
data
0 Metadata are key enablers for policy-making
0 Adding metadata can yield considerable benefits, including:
0 creating order in datasets
0 improving find ability, accessibility, storing and preservation of LOD
0 improving easily analyzing, comparing, reproducing, finding
inconsistencies
0 correct interpretation and visualizing of LOD
0 finding patters in LOD to generate new hypotheses
0 making linking of data easier
0 assessing and ranking the quality of LOD and avoiding unnecessary
duplication of LOD
CEDEM 2012
22. Conclusions (2)
0 Architecture for metadata:
0 discovery metadata can be generated from CERIF
0 common metadata formats can use CERIF as the superset exchange
mechanism
0 a LOD representation of the metadata for browsing or query can be
made allowing the use of SPARQL
0 while a conventional information systems capability with structured
query including convenient primitive operations can be maintained
0 We recommend to further implement the proposed metadata
architecture
CEDEM 2012
Editor's Notes
Start with a citation of NelieKroes - December 12, 2011This example shows that open data have gained considerable attention recentlyAnotherexample is the ENGAGE project
Framework Programme 7 shows that attention of the European Commission for Open DataENGAGE is part of FP7Mail goalThe paper that we present here stems from the ENGAGE project
What are open governmental data? Mention definition Geiger & Von Lucke.We adopt this definition as it excludes the publication of data which must remain confidential, are private or contain industrial secrets.Examples of open governmental data
Linking data providesuswith the benefits of open data; obtainvaluebylinking, showingrelationshipHow are LOD created?A public body produces anonymised (non-personally identifiable) data during the course of its ordinary business. Produceddata become freely available to everyone on the Web of Data, also referred to as the Semantic Web. The public sector data are then referred to as open data and can be used, reused and redistributed by everyone, without restrictions from copyright, patents or other mechanisms of control. A possibility of reusing open data is by linking themto other data to show relationships with these other data.The Linked Data that are the outcome of this linking are defined as “a collection of interrelated datasets on the Web”. Data which are both open and linked, referred to as LOD, are data that meet the requirements of open data and that also show relationships among the open data thus providing information which may be defined as structured data in context. After PSI is converted into LOD, this creates interesting possibilities for analyzing policies of public bodies. e.g. 2 datasets: 1 withdemographic data, 1 with crime data. Linkingthem on the basis of postal codes will shows relationsshipsbetweendemographic data and crime data.
We saw that publishing metadata is part of the LOD-process.Metadata are needed to make sense of the open data. Metadata are data about the data.We define metadata as “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource.”In the ideal situation for LOD, different types of metadata are provided: discovery (flat) metadata (which are descriptive and navigational),contextual metadata (which are descriptive, restrictive and navigational) detailed metadata (which cover schema metadata plus additional metadata to assure quality). These types of metadata describe among other things the following information about the LOD.Discovery (flat) metadata: identifier, title, creator, publisher, country, source, type, format, language, sector, subjects, keywords, relative information system, validity date (from – to), audience, legal framework, status, relevant resources and linked data sets.Contextual metadata: organizations, persons, projects, funding, facilities, equipment, services and pointers to detailed metadata.Detailed metadata: include quality (accuracy, precision, calibration and other parameters (Charalabidis, Ntanos, & Lampathaki, 2011) and domain or dataset-specific parameters that are used by software accessing and processing the dataset.
Benefits of the metadata according to the literature overview.
There are discrepancies between the benefits that are described in literature and the benefits that are obtained in reality. The currentsituation is insufficient.Statements
Based on the literature overview and twouse cases we found that the basic capabilities that are created by adding metadata are as follows:The metadata should be easily discovered;The metadata should interconvert common metadata formats used in PSI;The metadata should provide a LOD representation of the metadata for browsing or query;The metadata should maintain the capabilities of conventional information systems with structured query including convenient primitive operations.To accomplish these capabilities we need discovery, contextual and detailed metadata.
The challenge is to design an architecture to allow (a) end-user ‘citizen’ and ‘researcher’ access via a portal supported by metadata to PSI datasets for download; (b) access - utilising metadata – to those same datasets via a service from a running program on another system to utilise the information in another context. This leads naturally to an architecture sketched in Figure 2.
- CERIF provides a much richer metadata than the standards used commonly with PSI datasets and so improves greatly the experience of the end user (or the software) in processing the PSI datasets described by the enhanced metadata.- The representation of contextual metadata (CERIF) allows rich semantics to be represented simply over a formal syntax thus making the PSI datasets understandable to the end user (or software) through the enhanced metadata. - The Structured Query Language (SQL) usually presented to the end-user through an easy-to-use Query By Example (QBE) interface has a simpler structure than SPARQL and includes convenient primitive operations for simple statistical calculations such as sum, count, average.