Measuring Open Data Portal User-Orientation: A Computational Approach

University of Southampton Research Repository
Copyright © and Moral Rights for this thesis and, where applicable, any accompanying data are
retained by the author and/or other copyright owners. A copy can be downloaded for personal non-
commercial research or study, without prior permission or charge. This thesis and the accompanying
data cannot be reproduced or quoted extensively from without first obtaining permission in writing
from the copyright holder/s. The content of the thesis and accompanying research data (where
applicable) must not be changed in any way or sold commercially in any format or medium without
the formal permission of the copyright holder/s.
When referring to this thesis and any accompanying data, full bibliographic details must be given,
e.g.
Thesis: Author (Year of Submission) "Full thesis title", University of Southampton, name of the
University Faculty or School or Department, Master’s Thesis, pagination.
Data: Author (Year) Title. URI [dataset]

University of Southampton
Faculty of Engineering and Physical Sciences
Electronics and Computer Science
Measuring Open Data Portal User-Orientation: A Computational Approach
by
Mark Dix
Thesis for the degree of MSc Web Science
September 2019

University of Southampton
Abstract
Faculty of Engineering and Physical Sciences
Electronics and Computer Science
Thesis for the degree of MSc Web Science
by
Mark Dix
If downstream value of open data is to be realised, open data portals, the web-based interfaces by
which public bodies make their data available for re-use, must be designed for data use as well as
provision. Hitherto, research has prioritised the latter, with work addressing portal user-orientation
comparatively scarce and also limited in scope by manual methods of data collection and analysis. This
study, grounded in the interdisciplinary field of Web Science, aims to add to the nascent body of
research on open data portal user-orientation and contribute to the field by trialling a scalable,
computational methodology. Building on existing work, the study asks two questions: How user-
oriented are open data portals? What are the limitations to measuring user-orientation
computationally?
Based on a review of literature from Computer Science, Information Science and Marketing, the
research automates and applies an open data portal evaluation framework, developed by Simperl and
Walker (2017, 2019), to analyse user-orientation across 12 European portals. Following the 𝑛𝑛 = 𝑎𝑎𝑎𝑎𝑎𝑎
analysis of 157,749 datasets and 383,730 webpages, results indicate that, in line with existing
literature, portals are oriented toward data provision rather than use but that this differs for specific
features and particular portals. The computational approach is found to provide improved granularity
and, with further development, opportunity for such granular analysis at scale. The study is, however,
limited in its generalisability by its sample size (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = 12) and convenience sampling approach,
and in its validity due to reductive translation of Simperl and Walker’s (2017, 2019) framework for the
purpose of computation.
Key recommendations for portal providers include improved provision of provenance and versioning
metadata, consistent deployment of third-party social media and web analytics platform technology,
and continued effort to meet basic W3C accessibility standards. Further research is required,
principally, to refine and test the computational reinterpretation of Simperl and Walker’s (2017, 2019)
framework, and to migrate the code used for the study to a general purpose programming language
and scalable architecture.

Table of Contents
i
Table of Contents
Table of Contents .......................................................................................................... i
Table of Tables ............................................................................................................ iii
Table of Figures ............................................................................................................ v
List of Accompanying Materials ................................................................................... vi
Introduction ................................................................................................................vii
Chapter 1 Literature Review....................................................................................... 1
1.1 Open Data...................................................................................................................1
1.2 Open Data Portals ......................................................................................................3
1.2.1 Open Data Portals: The Supply-Side Perspective...............................................3
1.2.2 Open Data Portals: The Demand-Side Perspective............................................5
1.2.2.1 Website User-Orientation .......................................................................5
1.2.2.2 Open Data Portals: User-Orientation......................................................8
1.3 Conceptualising Open Data Portal User-Orientation...............................................10
Chapter 2 Methodology ........................................................................................... 13
2.1 Research Questions..................................................................................................13
2.2 Research Design .......................................................................................................13
2.2.1 Philosophical Grounding ..................................................................................13
2.2.2 Data Collection and Analysis ............................................................................14
2.2.2.1 Sampling ................................................................................................14
2.2.2.2 Analytics Pipeline...................................................................................16
2.2.2.3 Metric Construction...............................................................................19
2.2.3 Methodological Limitations..............................................................................30
Chapter 3 Analysis ................................................................................................... 31
3.1 Scope of Results........................................................................................................31
3.2 Inter-Category Analysis ............................................................................................32
3.2.1 Mean and Standard Deviation Category Scores ..............................................33

Table of Contents
ii
3.2.2 Demand-Side Versus Supply-Side Oriented Category Means and Standard
Deviations.........................................................................................................34
3.3 Intra-Category Analysis ............................................................................................35
3.3.1 Organise for Use...............................................................................................35
3.3.2 Promote Use.....................................................................................................36
3.3.3 Be Discoverable................................................................................................37
3.3.4 Publish Metadata .............................................................................................38
3.3.5 Promote Standards...........................................................................................39
3.3.6 Co-Locate Documentation................................................................................40
3.3.7 Linked Data.......................................................................................................40
3.3.8 Be Measurable..................................................................................................41
3.3.9 Be Accessible ....................................................................................................42
Chapter 4 Discussion................................................................................................ 44
4.1 Demand Versus Supply-Side Orientation.................................................................44
4.2 Category and Metric Performance...........................................................................45
4.3 Limitations of the Computational Approach............................................................49
Chapter 5 Conclusions.............................................................................................. 51
Appendix A 54
Appendix B 55
Appendix C 58
Appendix D 63
Appendix E 64
List of References ....................................................................................................... 65
Bibliography............................................................................................................... 72

Table of Tables
iii
Table of Tables
Table 2-1: Organise for use metric table..................................................................................... 20
Table 2-2: Promote use metric table........................................................................................... 21
Table 2-3: Be discoverable metric table...................................................................................... 22
Table 2-4: Publish metadata metric table................................................................................... 23
Table 2-5: Promote standards metric table ................................................................................ 24
Table 2-6: Co-locate documentation metric table ...................................................................... 25
Table 2-7: Linked data metric table............................................................................................. 26
Table 2-8: Be measurable metric table ....................................................................................... 27
Table 2-9: Co-locate tools metric table....................................................................................... 28
Table 2-10: Be accessible metric table........................................................................................ 29
Table 3-1: Category level summary statistics.............................................................................. 33
Table 3-2: Mean and standard deviation scores for demand and supply-side oriented categories
......................................................................................................................... 34
Table 3-3: Mean demand and supply-side scores for portals scoring above and below the demand and
supply-side category grouping means ............................................................ 35
Table 3-4: ‘Organise for use’ category metrics summary statistics ............................................ 35
Table 3-5: ‘Promote use’ category metrics summary statistics .................................................. 36
Table 3-6: ‘Be discoverable’ category metrics summary statistics ............................................. 37
Table 3-7: ‘Publish metadata’ category metrics summary statistics........................................... 38
Table 3-8: ‘Promote standards’ category metrics summary statistics........................................ 39
Table 3-9: ‘Co-locate documentation’ category summary statistics........................................... 40
Table 3-10: ‘Linked data’ category metrics summary statistics .................................................. 40
Table 3-11: ‘Be accessible’ category metrics summary statistics................................................ 42

Table of Tables
iv
Table 3-12: ‘Visual and hearing impaired support’ metric component summary statistics....... 42

Table of Figures
v
Table of Figures
Figure 1-1: Theoretical framework.............................................................................................. 11
Figure 2-1: ckanr dataset metadata extraction script................................................................. 16
Figure 2-2: Direct CKAN API call dataset metadata extraction script ......................................... 17
Figure 2-3: ‘m1’ function declaration for dataset package metadata transformation............... 18
Figure 2-4: ‘f’ and ‘r1’ function declaration for dataset resource metadata transformation.... 18
Figure 3-1: Number of pages analysed by portal ........................................................................ 31
Figure 3-2: Number of datasets for which metadata was analysed by portal............................ 31
Figure 3-3: ‘Be accessible’, ‘promote use’ and ‘promote standards’ category scores by portal 34
Figure 3-4 : ‘Promote use’ metric scores by portal..................................................................... 37
Figure 3-5: Proportion of completed DCAT-AP mandatory and recommended metadata fields38
Figure 3-6: ‘Promote standards’ metric scores by portal............................................................ 39
Figure 3-7: Proportion of portal pages with site analytics deployed versus not deployed ........ 41

List of Accompanying Materials
vi
List of Accompanying Materials
• Annotated demonstration R code and dependency files for user-orientation analysis of the open
data portal for Ireland: https://github.com/dxmrk/Open-Data-Portal-Analysis.
• A private Sharepoint has been configured for all project assessors where the raw, non-annotated
scripts and dependency files for this project can be accessed.

Introduction
vii
Introduction
The past decade has seen the open government data movement gain considerable traction. Driven by
demand for increased transparency and scrutiny, as well as a growing understanding that political,
economic and social value can be gleaned from repurposing of the data they generate, government
and public bodies now make an increasing proportion of their data available for public re-use under
open licence. In parallel with the rise of the open government data movement, fuelled by ubiquitous
public use, the web now represents an important channel for government-citizen communication.
Widespread use, coupled with access to low-cost, cloud based storage and internet bandwidth also
make the web a logical publishing platform for open government data. As such, open data portals –
“web-based interfaces designed to make it easier to find re-usable information” (European
Commission, 2018) – are estimated to number more than 2,600 globally (OpenDataSoft, 2019), and
have become the dominant mode for public bodies to publish open data online.
Downstream use value of open government data is not guaranteed, however, simply by virtue of legal
‘openness’ and also depends on the provision of data in a form that permits and actively promotes re-
use (European Data Portal, 2018a). In this regard, since deployment of the first open data portals in
2009/10, portal providers and researchers have focussed on what might be considered the ‘supply-
side’ characteristics of open data, principally the concerns of Computer and Information Science
relating to information search and retrieval. Led by such disciplinary thinking, open data has become
closely associated with the idea of ‘linked data’, in particular, the leveraging of semantic web
standards that permit online search, retrieval and interoperability across otherwise disparate datasets.
Effort on behalf of ‘linked open government data’ practitioners and researchers has, therefore, tended
to skew toward topics related strictly to data supply, notably data and metadata quality.
However, as argued by Simperl and Walker (2019, p5) “it is not enough to just publish good quality
data to a portal, as it can create ‘virtuous data dumps’”. The implication here is that, while provision of
quality data and metadata is important, if this is done without consideration of end user needs then
data re-use will be limited and, ultimately, downstream use value restricted. To avoid a situation
where open data portals become ‘virtuous data dumps’, research is required to augment the
comparatively sparse body of work relating to ‘demand-side’ issues of data use, in particular, to gauge
the extent to which portals are currently orientated to meet user needs and to provide evidence-
based, actionable recommendations for development in this area. This is the primary motivation and
overall aim of the research and underpins two related research objectives: to measure the user-
orientation of open data portals with respect to (i) provision of features aligned to data use versus
data provision and (ii) provision of specific features that promote portal use and re-use.

Introduction
viii
In order to address these objectives, the research is located within the discipline of Web Science,
taking the web – in this case open data portals - as its main object of study, and employing an
interdisciplinary theoretical framework to extend the study of portals to incorporate demand-side
issues of data use, as well as supply-side issues of data provision. In this regard, a recognised body of
research from the Marketing discipline, which establishes user-oriented design principles as
determinants of website use and reuse behaviour, is integrated theoretically with work from
Computer and Information Science that explicates such principles in the context of open data portals.
Drawing directly on this theoretical approach, the methodological component of the study
operationalises an evaluation framework, proposed by Simperl and Walker (2017, 2019), comprised of
measurement items relating specifically to open data portal user-orientation.
Based on Simperl and Walker’s (2017, 2019) framework, the empirical contribution made by this
research is therefore an evaluation of user-orientation for 12 national level European open data
portals. Various measurement items are addressed that relate to portal data provision and use
characteristics, with analysis and discussion focussed on establishing an aggregated view of user-
orientation across the 12 portals. Rather than presenting data ‘portal-wise’, therefore, the study offers
a ‘category-wise’ view of user-orientation, with reference to the categories that make-up the
evaluation framework. That being said, category and composite metric scores for each portal are
included in data tables provided in the appendices.
With the above in mind, the main contribution of this study is, perhaps, not empirical but
methodological. Identified in the review of literature is a shortcoming, inherent to the methodological
approach taken by researchers concerned with open data portal user-orientation, which is
characterised here as the ‘small n’ problem. Simply put, the nascent body of research in this area has
favoured a manual over an automated or computational approach to data extraction and analysis,
resulting in the drawing of conclusions and recommendations for individual portals, or portals
collectively, based on assessment of only a small proportion of datasets or webpages. Acknowledged
in the literature is considerable heterogeneity both between and within portals, in-terms of features
relating to data provision and use, and it follows that generalisations based on the analysis of only a
small subset of these features may be susceptible to error.
To address the ‘small n’ problem, this study develops and trials a computational approach to the
evaluation of open data portal user-orientation. Composite metrics that make up Simperl and
Walker’s (2017, 2019) framework are therefore translated, as is feasible, to mathematical form, and
applied to analyse dataset metadata and HTML data extracted from the 12 portals via API connections
and web scraping respectively. In total, 157,749 datasets and 383,730 webpages are analysed using
this approach, which represents a population level dataset analysis (i.e. 𝑛𝑛 = 𝑎𝑎𝑎𝑎𝑎𝑎), and analysis of all

Introduction
ix
webpages higher than or equal to the subfolder depth at which dataset links are provided by each
portal. As has been noted, for brevity, here data is aggregated and presented category-wise, but could
equally be represented portal-wise or at the level of individual datasets and webpages. While such an
approach has obvious strengths relating to actual and potential scale, efficiencies and granularity,
notable limitations also arise from both the particular methodology employed here and the
reductionism inherent in reinterpretation of Simperl and Walker’s (2017, 2019) portal evaluation
framework to mathematical form. Such potential limitations give rise to the third research objective of
this study: to empirically observe and critically discuss the limitations of a computational approach to
measuring open data portal user-orientation.
In what follows, a review of literature locates the study, first, in the broad context of open data,
before reviewing existing research into data provision and use in the context of open data portals.
Relevant work related to website user-orientation is cited from the marketing literature, and an
interdisciplinary theoretical framework provided to address the research topic. A methodological
chapter provides philosophical grounding for the research approach, as well as detailed discussion of
computational data extraction, transformation and analysis methods. A dedicated analysis chapter
presents the results of the study prior to detailed critical discussion. Finally, conclusions are drawn
from the research and a number of practical recommendations are made for portal providers as well
as recommendations for further research.

Chapter 1
1
Chapter 1 Literature Review
1.1 Open Data
Geiger and Von Lucke (2012) suggest that changes in society, politics and technology are changing
both the process of doing and administering politics, as well as the relationship between politicians,
civil servants and citizens. Advancements in technology infrastructure and use (Chen et al., 2014;
Buchholtz et al., 2014), social demand for customised public services (Geiger and Von Lucke, 2012) and
a shift from ‘information’ to ‘communication’ in politics (Janssen et al., 2012; Cowan et al., 2014)
underpin the citizen participation movement: the idea that better political and service solutions can
be produced by collaboration between the political establishment and citizens.
Located within the broader Public Sector Information (PSI) movement, which seeks to make
information produced by public sector administrations ‘open by default’ (OECD, 2008 in Ubaldi, 2013;
Hansen et al., 2013; Reggi and Ricci, 2011), ‘open government’ requires transparency between citizens
and administrations, and greater accountability through provision of mechanisms for public scrutiny
and oversight (Zuiderwijk et al., 2014): “one requirement for releasing these central points is free
access to open data” (Geiger and Von Lucke, 2012; p265). Open data is defined as: “data that can be
freely used, re-used and distributed by anyone, only subject to (at most) the requirement that users
attribute the data and that they make their work available to be shared as well” (Ubaldi, 2013, p6;
European Data Portal, 2019).
As the web has become the dominant communication mechanism between citizens and governments,
and for widespread dissemination of data (Lnenicka, 2015), ‘open government data’ is thus often
equated with e-government and the extent to which the use of ICT, particularly the web, and data
permits a greater degree of transparency, participation and collaboration between state and citizen
(Geiger and Von Lucke, 2012; Zuiderwijk et al., 2014).
Two closely related technological concepts are central to understanding the state of open data today:
‘linked data’ and ‘big data’.
Berners-Lee (2006) details a set of principles that underpin linked data, providing a blueprint for
publishing structured data to the web. Berners-Lee (ibid) advocates the use of semantic web
technologies including identifiers, standard syntaxes, metadata and ontologies. When publishing data
to the web, adherence to semantic web design principles permits interaction, complex analysis or
application development across otherwise disparate data sources (Hyland and Wood, 2011). Joining
the concepts of open and linked data, Heath and Bizer (2011) suggest that ‘linked open data’ affords a

Chapter 1
2
set of principles, standards and technologies - an architecture - for the open discovery, definition,
integration and re-use of data. In the context of open government data, the extent to which
administrations adhere to such principles, in part, determines the potential usefulness and value of
data published to the web.
Closely related to ‘linked open data’ is the concept of ‘big data’ which, building on a number of widely
cited definitions, Chen et al. (2014, p173) define according to four characteristics – the ‘four Vs’:
“Volume (great volume), Variety (various modalities), Velocity (rapid generation), and Value (huge
value but very low density)”. Pace of technological change and the amount of digital data available
means that an increasing proportion of data released under an open licence is already considered to
be ‘big’, or is accumulating towards big data status (Lnenicka and Komarkova, 2019). It follows that, in
addition to adherence to open and linked data licensing and design principles, in order to supply open
data in a way that minimises misuse, misinterpretation and negative effects on the public body that
provides the data (Kucera and Chalpek, 2014), it may also be necessary to address big data
characteristics.
Bearing in mind these challenges, much historical research into open data is concentrated towards the
supply-side, focussing on operational and technical barriers to data provision. Research finds open
data provision restricted by fragmentation in the policies and licensing that predetermine the release
and re-use of data (Janssen et al., 2012; Kalampokis et al., 2011; Zuiderwijk et al., 2014), as well as
technical issues relating to ICT infrastructure and data and metadata quality (Braunschweig et al.,
2012). However, a complementary strand of research also addresses issues relating to data use as well
as provision, and there is growing acknowledgement of the importance of addressing such demand-
side issues in open data programmes (Dawes, 2010; Geiger and Von Lucke, 2012; Reggi and Riccj,
2011; Solar et al., 2012; Ubaldi, 2013). In their analysis of the open data operating strategies employed
across 434 European Union open data programmes, Reggi and Ricci (2011) find the presence of three
distinct strategies employed by administrations:
• Regulation centred: programmes that publish data according to minimum requirements, often
in non-machine readable, proprietary formats, and with no further detail or metadata;
• User centred: programmes that focus on data discoverability and presentation;
• Re-user centred: programmes that focus on data quality and validity.
Administrations are found to focus either on ‘data stewardship’ (point one) or ‘data use and
usefulness’ (points two and three), but rarely both (Janssen et al., 2012; Lee et al., 2012).
While the distinction between data supply and demand-side issues may seem clear cut, more recent
studies have stressed the interdependency of the two perspectives: failure to adhere to data provision

Chapter 1
3
standards restricts downstream data use value, while failure to address issues of data use can lead to
repositories of open data that are no more than “virtuous data dumps” (Simperl and Walker, 2019; p5;
see also Alexopoulos et al., 2014; Charalabidis et al., 2014; Kalampokis et al., 2013).
1.2 Open Data Portals
In accordance with the studies cited above, Yang et al. (2015) argue that a broad range of factors
affect the accessibility and usefulness of open data for prospective users. The authors discuss the
provision of open data in the context of open data portals, which Lnenicka (2015; p593) describes as
the web based “interfaces between government data on one side and re-users on the other”. Portals
are web based systems, operated by data providers (government agencies or other organisations),
which collect and provide access to datasets in a variety of formats from different sources. The first
open data portals launched in 2009 in the United States and 2010 in the United Kingdom, and an
increasing number of public institutions now make data from a wide range of domains available to the
public via web based data portals. The 2018 Open Data Barometer Global Report (World Wide Web
Foundation, 2018) reports that more than half of the 92 countries surveyed had an open data
initiative, while OpenDataSoft reports more than 2,600 open data portals globally (OpenDataSoft,
2019).
1.2.1 Open Data Portals: The Supply-Side Perspective
Correa et al. (2018; Neumaier et al., 2016) report that many governments have launched open data
portals based on open data publishing software including Comprehensive Knowledge Archive Network
(CKAN), Drupal Knowledge Archive Network (DKAN), Socrata, ArcGIS Open Data, and OpenDataSoft.
Depending on the configuration of the portal, users are able to, view, download or access data either
directly or via an Application Programming Interface (Jovanovik, 2012; Van der Waal et al., 2014), and
tagging of datasets allows users to search thematically. Datasets are not themselves typically part of
the catalogue record, rather they are linked to via a download or webpage link from the portal. Each
dataset may comprise several resources or data files (Van der Waal et al., ibid).
The design of portals is not prescribed and Verma and Gupta (2015) report a vast number of possible
dataset categorisation strategies to increase accessibility through improved dataset search and
retrieval including domain, provider and format categorisation. Datasets are typically retrieved via
keyword search and data may be represented in a variety of formats, which determines the extent to
which they are machine readable and interoperable, and so may be integrated with other datasets
(Verma and Gupta, ibid). Additional issues relating to dataset quality include validity, generalisability,
reliability and completeness, and it is important for both data providers and users to consider the

Chapter 1
4
overall quality of the data with respect to these issues (Dawes and Helbig, 2010; Hansen et al., 2013;
Hyland and Wood, 2011).
Braunschweig et al. (2012) stress the importance of complete, accurate metadata, such as descriptions
or tags, to permit manual or automatic search. The metadata structure of an open data portal
provides common properties that describe its datasets. Metadata properties are numerous but, at
minimum, may include the dataset name, description and URL of the data resources i.e. files or end
points (Van der Waal et al., ibid). More comprehensively, data portals should facilitate interoperability
by providing well defined metadata semantics. The Data Catalogue Vocabulary (DCAT) is a World Wide
Web Consortium (W3C) RDF vocabulary that standardises dataset metadata and so makes possible
interoperability across multiple data catalogues (Pullmann et al., 2019). In work leading to the
development of the DCAT vocabulary, Maali et al. (2010) surveyed the metadata of seven open data
catalogues from five different countries to assess the number of datasets, data formats, metadata
structural characteristics including consistency, availability and categorisation, and dataset availability.
Only around half of datasets were found to be available in a machine readable format and, while
metadata fields were found to be relatively consistent across the portals, there was high variation in
the completeness of those fields: many portals were inconsistent in the implementation of metadata
to different datasets.
While DCAT provides a basis for the standardisation of open data metadata information, Berners-Lee’s
(2006) set of principles for publishing structured or linked data to the web have now also been
formalised in the ‘5 ★ Open Data’ framework, which comprises five data format categories each
providing a successively greater potential for dataset linking or interoperability (Verma and Gupta,
2015). The base level requirement of this framework is for organisations to make data available under
an open licence, while the remaining levels specify data formats that allow for a progressively higher
degree of interoperability: (1) open licence data; (2) machine readable data formats; (3) non-
proprietary data formats; (4) Resource Description Framework (RDF) standard data formats; (5)
explicitly linked data. In their review of 35 open data portals, Sayogo et al. (2014) find almost all (88%)
portals to provide data in a granular rather than aggregated format, two-thirds (66%) in a machine
readable format but only half in a linked data format (e.g. RDF, JSON, XML or via API). Around 80% of
portals had adopted an open licence for the datasets surveyed. The use value of open data is
therefore not only restricted by incomplete dataset metadata information, but also by licensing and
format inconsistencies that prevent access, machine readability and interoperability.
The remainder of this literature review takes its lead from Simperl and Walker’s (2017, p6)
observation that it is not enough to just publish good quality data to a portal but also “to think about
moving to the next stage in publishing, managing and using data”. Such reorientation toward

Chapter 1
5
assessment of open data portals with respect to portal ‘use’ and ‘users’ does not, however, preclude
the need to retain consideration of the data provision issues discussed above. It is important for portal
providers to address both perspectives - supply-side (provision) and demand-side (use) - if they are to
provide a high quality user experience and maximise open data use value. As is discussed in what
follows, failure to adhere to data provision standards restricts the ‘usefulness’ of open data, while
provision of data in such a way that is not ‘easy to use’ or oriented towards user requirements limits
the chances of use and re-use.
1.2.2 Open Data Portals: The Demand-Side Perspective
The following section locates open data portal ‘user-orientation’ within the broader body of Marketing
literature, where website usability and user-orientation is an established topic. Core theories,
frameworks and research findings drawn from this discipline are discussed, before examination of the
nascent body of research focussing specifically on user-orientation in the context of open data portals.
1.2.2.1 Website User-Orientation
Loiacono et al. (2002) note that a critical concern in both Information Science and Marketing is the
measurement of website quality, specifically the identification of website features that predict
consumer use and re-use. The authors suggest that, while websites are primarily a form of information
system and so require proper consideration of information storage, display, processing and transfer
functionalities, using a website is also a form of marketing interaction: "information is passed,
consumer's questions are answered, and purchases are made" (p9).
In the Marketing literature user-orientation is predominantly theorised under two well-established
frameworks: ‘The Theory of Reasoned Action’ or ‘TRA’ (Ajzen and Fishbein, 1980) and ‘Technology
Acceptance Model’ or ‘TAM’ (Davis, 1989; Venkatesh, 2000). Ajzen's and Fishbein's (1980) TRA posits
that individuals form attitudes about the consequences of their behaviour and that such attitudes,
along with subjective norms, determine behavioural intentions and action. While TRA frames the
sequential, causal link between the beliefs, attitudes, intentions and actions of individuals, it is
situation independent and makes no reference to particular contexts in which the model should be
applied. The TAM (Davis, 1989; Venkatesh, 2000) posits that, in the evaluation of any technology,
users forms beliefs about two key factors: ‘ease of use’ and perceived ‘usefulness’. Synthesis of the
two models therefore suggests that the extent to which individuals believe a technological artefact to
be easy to use and useful affects both attitude to act (i.e. to use the artefact) and subsequent use and
re-use behaviour.

Chapter 1
6
Using the TRA and TAM as theoretical grounding, Marketing and e-commerce researchers consistently
cite a number of broad principles that underpin the development of websites positively received by
users. With reference to a considerable body of literature relating to commerce websites, Huang and
Benyoucef (2012) argue that websites that are usable, built to high quality specifications, provide
quality information and support services, and that are aesthetically well-formed typically lead to
positive beliefs, attitudes and behaviours on behalf of users. Each of these principles is now discussed
in more detail.
‘Usability’ is a broad term and refers to website ease of use (Huang and Benyoucef, ibid). In the
literature, usability is found to be consistently determined by an intuitive and simplistic website
structure and provision of functional navigation that affords users a high degree of control (Hasan and
Abuelrub, 2011; Helander and Khalid, 2000). Closely linked to usability is ‘accessibility’, which refers to
the extent to which websites are accessible for all users, particularly those with disabilities (Hasan and
Abuelrub, 2011; W3C, 2019). The Web Content Accessibility Guidelines (W3C, ibid) suggest a range of
practical measures to support accessibility which, although wide-ranging, begin with a basic adherence
to HTML and HTML5 mark-up standards, making possible the use of dictation and screen reader tools
for the visually and hearing impaired (Mozilla, 2019). Despite the relatively simplicity of at least basic
adherence to such guidelines, the UK charity Ability Net (2018) finds less than 10% of sites accessible
under W3C guidelines.
Information quality, in the context of website user-orientation, refers to the information
characteristics of a website that help or hinder user task outcomes (Huang and Benyoucef, 2012).
Information relevance is a key characteristic of information quality cited by Susser and Ariga (2006;
Jaiswal et al., 2010; Liu and Arnett, 2000), and relates to co-location of information that supports user
tasks. In a survey of e-commerce website users (𝑛𝑛 = 351), Flavian et al. (2006) find perceptions
relating to information provision correlate positively with those of website usability (∝= 0.85, 𝑝𝑝 <
0.01), and usability to positively predict user trust (𝑟𝑟2
= 0.23, 𝑝𝑝 < 0.01) and satisfaction (𝑟𝑟2
=
0.51, 𝑝𝑝 < 0.01) which, in-turn, determine website loyalty (trust: 𝑟𝑟2
= 0.27, 𝑝𝑝 <
0.01; satisfaction 𝑟𝑟2
= 0.31, 𝑝𝑝 < 0.01 ). Information completeness, accuracy and recency are also
found to be key determinants of information quality, and to directly affect users’ ability to complete
tasks (Susser and Ariga, 2006; Hasan and Abuelrub, 2011; Jaiswal et al., 2010; Liu and Arnett, 2000).
System quality concerns technical website performance and, in particular, the effect of technical and
functional characteristics on task outcome (Huang and Benyoucef, 2012; Liu and Arnett, 2000; Lee and
Kozar, 2006). In their survey of 𝑛𝑛 = 119 e-commerce webmasters, Liu and Arnett (2000) find
functionality most strongly related to perceptions of website quality (∝= 0.92, 𝑝𝑝 < 0.01). While
information provision (∝= 0.88, 𝑝𝑝 < 0.01) and aesthetics (∝= 0.83, 𝑝𝑝 < 0.01) are also important, if

Chapter 1
7
the core functionality of a website is poor then beliefs relating to ease of use and usefulness are
fundamentally undermined. Liu and Arnett’s (ibid) finding that core functionality is so closely related
to the experience of users, confirms that comprehensive assessment of website user-orientation must
consider issues on both the supply and demand-sides.
While initially, serious consideration of aesthetics appears to be relevant for only a limited range of
websites, Robins and Holmes (2008; Lee and Lee, 2003) find, in 90% of cases, aesthetic treatment to
improve perceptions of content credibility. Katerattanakul (2002) considers aesthetics in-terms of the
extent to which rich media and visual applications may reduce the cognitive load of users versus text-
only consumption. Katerattanakul (ibid) finds quality website aesthetics, particularly appropriate uses
of rich media and visual applications, to promote experimentation, exploration, and to stimulate
cognitive curiosity, which is of clear relevance in the context of open data portals.
A final component of user-orientation frequently addressed in the Marketing literature is service
quality, which concerns the feedback, dialogue and issue resolution processes that support users. Liu
and Arnett (2000; Lee and Kozar, 2006; Parasuraman et al., 1994) propose that mechanisms should
exist to allow for service user and provider interaction, and that providers should be willing and able
to support users in a timely and reliable way. Jaiswal et al. (2010) find customer service provision, in
particular the availability of customer service information and personnel, to be significant
determinants of e-commerce website user satisfaction (p < 0.05).
The Marketing, and particularly e-commerce, literature therefore provides a strong general grounding
for what may or may not constitute a website that is user-oriented i.e. one that is both easy to use and
useful. However, Goodhue and Thompson (1995, p213) argue that “for an information technology to
have a positive impact on individual performance… [it] must be a good fit with the task it supports”.
This assertion suggests that, while there may be more general principles to guide website user-
orientation, particular categories or metrics of measurement are necessarily dependent on the
context of use. This idea has been formalised in the software engineering discipline by the
International Standards Organisation under ISO 25010:20111
, which aligns ‘usability’ specifically to the
idea that technology should be ‘fit-for-purpose’. It is critical, therefore, that any assessment of open
data portal user-orientation goes beyond the more general principles detailed above, and considers
measurement categories and metrics that are specific to the context of use.
1
ISO 25010:2011 defines usability as “the degree to which a product or system can be used by specific users to meet their
needs to achieve specific goals with effectiveness, efficiency, freedom from risk, and satisfaction in specific contexts of use”
(International Standard Organisation (2011; in Weichbroth, 2018, p1008).

Chapter 1
8
1.2.2.2 Open Data Portals: User-Orientation
In line with Simperl and Walker’s (2017) argument that simply supplying data is not enough to
guarantee downstream use value, a limited strand of recent research has sought to address user-
orientation specifically in the context of open data portals.
Sayogo et al. (2014) augment the Sunlight Foundation’s (2010) ‘Ten Principles for Opening Up
Government Information’ assessment framework, originally developed as a check-list for open data
providers to address supply-side issues such as licensing, data quality and metadata quality, with a
number of categories strictly relating to data use. Applying the framework to open data portals in 35
countries, Sayogo et al. (2014) find portals to be at different stages of development, with some
focussing only on issues relating to data provision and others also addressing issues of data use.
‘Shareability’ is found to be good overall, with 70% of portals using either social media or newsfeeds
for content distribution, while other user-orientation features such as support for non-native language
speakers and data visualisation are found only in the minority of cases.
Alexopoulos et al. (2014) suggest that less developed portals are those that align to the web 1.0 model
of basic information search and retrieval, but offer “limited capabilities for stimulating and
facilitating…value” (p239). The authors propose a range of more advanced, web 2.0 features that can
improve portal usability and stimulate positive use outcomes including mechanisms for user feedback,
collaborative working, and data quality ratings. A similar distinction between less and more mature
portals is made by Coelpart et al. (2013), who suggest that, while the former may provide only links to
open datasets, the latter also tend to offer a structured metadata catalogue as well as facilities for
data visualisation and user interaction. Colpaert et al. (ibid) also note that ‘five-star portals’, the most
advanced category of portal, address advanced supply-side issues relating to data provenance, trust
and versioning, which have a direct subsequent impact on ease of use and usefulness.
While dedicated studies into open data portal user-orientation are still relatively limited, the European
Data Portal’s (2018b) ‘Open Data Maturity in Europe’ report, which assesses the state of open data for
the 28 EU member and four EFTA states, incorporates a substantive portal measurement component
that assesses portals on four metric categories2
:
• Usage monitoring: around three-quarters of portals are found to engage in systematic use
behaviour monitoring via a site analytics tool;
2
A data table providing the country level portal scores for each category is provided in Appendix A..

Chapter 1
9
• Feature provision: portals are heterogeneous in the extent to which they offer basic or
advanced features; a majority provide basic search, filtering and dataset downloads, while a
minority offer advanced user-oriented features such as dataset previews or a newsfeed;
• Advanced data provision: real-time data is found to be provided by around three-quarters of
portals but accounts for a low proportion of data overall;
• Portal sustainability: the majority of portals have no strategic approach in place for portal
funding, beyond pure state provision.
In alignment with the findings of studies discussed previously, the European Data Portal (ibid) report
stresses the heterogeneity of portals in their current guise and also supports the distinction between
those that are less advanced, offering only basic data provision functionality, and those that are more
advanced, addressing issues of data use as well as provision.
Recent work by Simperl and Walker (2017, 2019) is perhaps the most comprehensive effort to-date, to
provide an open data portal assessment framework that is explicitly user-oriented. Simperl and
Walker (ibid, p5) suggest that “it is necessary to move to the next stage in publishing, managing and
using data, by understanding the needs of average citizens and data professionals and by choosing
adequate tools to deliver the capabilities and user experience people are asking for”. Drawing on the
open data portal literature, Simperl and Walker’s (2017, 2019) framework proposes 10 ‘sustainability
principles’ (categories), comprised of 47 metrics, towards which portals might aim if they are to
provide value for users. The framework does not ignore supply-side considerations such as data format
and metadata provision but frames these in the context of user-orientation: the extent to which they
support ease of use and usefulness.
Simperl and Walker’s (2017, 2019) framework represents a core theoretical and methodological
component for the current research and, as such, is covered extensively in the following chapters; a
full analysis of the authors’ framework application to 10 EU open data portals can also be found in
Appendix B. In summary, however, in line with the body of research in this area, Simperl and Walker
(2019) find considerable variation in the user-orientation of portals, with some scoring highly on
categories aligned to supply-side data provision and others on categories aligned to demand-side data
use.
1.2.2.2.1 Research into Open Data Portal User-Orientation: the ‘nDataset’ or ‘nWebpage’ Issue
Of the reported studies3
that exclusively or substantively measured open data portal user-orientation,
all three employed a manual approach to data collection, either via direct observation and feature
3
Sayogo (2014); Simperl and Walker (2019); European Data Portal (2018).

Chapter 1
10
scoring or portal provider self-assessment questionnaire. A manual approach to data collection limits
the scale of the research to tens or possibly hundreds of webpages and datasets for any given portal
(Bauer and Scharl, 2000; Gibson, 2011; Trilling and Jonkman, 2018).
While Sayogo (2014) reports a study of open data portals in 𝑛𝑛 = 35 countries, as the approach to
collection of data relating to user-oriented metrics is manual, it is not feasible for the author to
examine every portal dataset or page; worryingly, the actual number of datasets and pages examined
is not reported by the author. A similar manual assessment approach is also taken by Simperl and
Walker (2019).
Petychakis et al. (2014), in their supply-side analysis of portals for all 28 European Union member
states, are limited to metadata analysis of a total of 𝑛𝑛 = 3,466 datasets. To put this in context, at the
point of writing, the European Data Portal reports a total of 𝑛𝑛 = 893,161 open datasets made
available by European Union member state portals. With regards to the work of Petychakis et al.
(2014), while a dataset sample size of 𝑛𝑛 = 3,466 far exceeds requirements specified by, for example,
Chochran (19774
), computational approaches to data collection and analysis offer the opportunity to
scale the analysis of open data portals to the dataset population level (i.e. 𝑛𝑛 = 𝑎𝑎𝑙𝑙𝑙𝑙). Such an approach
may be particularly advantageous when, as has been discussed, there remains strong heterogeneity in
data provision and use standards both between and within portals.
Regarding portal provider self-assessment questionnaires used by, for example, the European Data
Portal (2018b), given the political requirement for modern administrations to demonstrate openness
and transparency under the PSI movement (Hansen et al., 2013; Reggi and Ricci, 2011; Ubaldi, 2013;
Zuiderwijk et al., 2014), such methods of data collection are particularly susceptible to self-reporting
bias (Podsakoff et al., 2003). While, as is the case for research by the European Data Portal (2018b),
appropriate measures can be taken to improve generalisability and validity such as expert-led cross-
checking (Creswell, 2009), this is still restricted by the labour and resource requirements demanded by
a manual approach to data collection and analysis.
1.3 Conceptualising Open Data Portal User-Orientation
To conclude this review of literature, with the aim of safeguarding the future of open data portals,
user-orientation has been identified as an important topic for both portal providers and researchers.
Website user-orientation is a well-established topic in the Marketing and e-commerce literature and,
4
Cochran’s (ibid) formula of 𝑛𝑛0 =
𝑍𝑍2𝑝𝑝𝑝𝑝
𝑒𝑒2
where 𝑒𝑒=margin of error; 𝑝𝑝=estimated population proportion; 𝑞𝑞=1 − 𝑝𝑝; applied to
any given population where an unknown proportion of the population (50/50) has the attribute in question, a desired
confidence level of 95% and an acceptable margin of error of 5%:
((1.96)2(0.5)(0.5))
(0.05)2 = 385.

Chapter 1
11
while research into open data portal user-orientation concerns many similar theoretical and more
general issues, there is a requirement for further development of research in this area that addresses
contextually specific features relating to both data provision and data use.
Simperl and Walker’s (2017, 2019) portal assessment framework advances an explicit and
comprehensive approach for the measurement of open data portal user-orientation, however,
heterogeneity of data provision and use features, within and between portals, makes generalisation of
findings from resource-restricted, manual analyses difficult.
With this context in mind, drawing on more general theories of user-orientation from the Marketing
literature and Simperl and Walker’s (2017, 2019) contextually specific research into open data portals,
Figure 1-1 presents an interdisciplinary theoretical framework to underpin the analysis of open data
portal user-orientation. The framework resolves conflict (Repko, 2006; Klein, 2016; Klein and Newell,
1996) and finds ‘common ground’ (Repko, 2008) at the theoretical level between the study of open
data portal user-orientation in the Computer and Information Sciences, and technology acceptance
and adoption in Marketing. Chapter 2 provides a methodological approach for computational
application of this framework, which is scalable and aims to reduce assessment resource costs,
improve consistency (Parmanto and Zeng, 2005) and so address the ‘nDataset’ and ‘nWebpage’ issue.
As indicated by Figure 1-1, the TAM (Davis, 1989; Venkatesh, 2000) is used as a conceptual basis for
this study, although it is not operationalised methodologically. Simperl and Walker’s (2017, 2019) 10
Figure 1-1: Theoretical framework
TechnologyAcceptance Model
(Davis, 1989)
Open Data Portal User-Orientation
(Simperl and Walker, 2017; 2019)
Theoryof Reasoned Action
(Ajzen and Fishbein,1980)
Ease of use
(belief)
Usefulness
(belief)
Be discoverable
Publish metadata
Promote standards
Linked data
Organise for use
Promote use
Co-locate documentation
Be measurable
Co-locate tools
Be accessible
Attitude to act;
use behaviour

Chapter 1
12
categories of open data portal user-orientation are roughly aligned to the TAM’s ‘ease of use’ and
‘usefulness’ components, demand-side (data provision) oriented categories to the former and supply-
side (data use) oriented categories to the latter. Figure 1-1 also illustrates a conceptual link between
the TAM, open data portal user-orientation and Ajzen's and Fishbein's (1980) TRA. While the TRA is
not operationalised methodologically here, it is useful to understand the logic behind the proposed
approach: the belief, on behalf of a user, that an open data portal is ‘easy to use’ and ‘useful’, as
determined by the degree to which it user-oriented, subsequently affects use and re-use attitudes and
behaviours.

Chapter 2
13
Chapter 2 Methodology
2.1 Research Questions
Based on the review of literature, two research questions are stated for this study:
RQ1: How user-oriented are open data portals with respect to:
RQ1a: Provision of features aligned to data use versus data provision?
RQ1b: Provision of specific features that promote portal use and re-use?
RQ2: What are the empirically observed limitations of a computational approach to measuring open
data portal user-orientation?
The first research question underpins the empirical contribution made by the study and requires, first,
that an overall assessment be made of the general supply versus demand-side orientation of open
data portals and, second, a more specific assessment of Simperl and Walker’s (2017, 2019) user-
orientation features. The second research question underpins the methodological contribution made
by the study, and predicates the application and post-hoc critical assessment of a computational
response to research question one. The combination of these two research questions addresses, first,
the relative scarcity of literature relating explicitly to open data portal user-orientation and, second,
the current reliance on manual approaches to portal evaluation that compromise the generalisability
of findings in this area.
2.2 Research Design
2.2.1 Philosophical Grounding
The approach taken here is quantitative in nature. The methodology is grounded in post-positivist
philosophical assumptions so far as it identifies particular variables to study, and collects and critically
analyses numerical information relating to user-orientation (Creswell, 2009). However, this grounding
comes with some strong caveats. The methodology is also rooted in a contemporary view of
empiricism, made possible by computational approaches to data collection and analysis, and closely
aligned to big data methods. The data analysed here fits some, although not all, of the big data
characteristics detailed previously by Chen et al. (2014):
• It is of high ‘variety’, comprising both structured and unstructured data;

Chapter 2
14
• It is of high ‘value’ but low density, as the data and metadata quality of portals tends to be
heterogeneous even within individual portals (Verma and Gupta, 2015). Thus, in order to
make assertions about the user-orientation of a particular data portal, it is not enough to
analyse one or two datasets or webpages rather it is necessary to analyse many or all;
• While the size of the data analysed here is deliberately restricted in its ‘volume’
(approximately 2GB) and ‘velocity’, both the data and the approach employed could be scaled
relatively simply to incorporate these characteristics.
Critically, the approach taken here matches a further big data analysis characteristic detailed by
Kitchin (2013 in Kitchin 2014, p1): “[it is] exhaustive in scope, striving to capture entire populations or
systems (𝑛𝑛 = 𝑎𝑎𝑎𝑎𝑎𝑎)”. The exhaustiveness of the proposed approach means that, instead of taking
dataset or webpage samples from open data portals and either scaling findings from sample to
population level by way of statistical inference or qualitative generalisation (Creswell, 2009), here the
aim is to extract population level data and to analyse that complete set of data (Anderson, 2008;
Dyche, 2012; Prensky, 2009).
This approach locates the current research within what Hey et al. (2009) term the ‘fourth paradigm of
science’, which is exploratory in its epistemology and employs data-intensive techniques in the
statistical exploration and mining of population level datasets. Possible applications for the use of such
quantitative approaches in the evaluation of website quality, suggest Bauer and Scharl (2000, p31-32)
amount to three core areas:
• Snapshot analysis: “Analysis of a large number of websites at a given time allows comparison
of individual criteria (e.g. means, variances, or other statistical parameters)”;
• Longitudinal analysis: “A defined set of websites can be documented and analysed over a
longer period of time, i.e. a series of snapshot analyses”;
• Comparative analysis: “the analysis of…competitors’ efforts [to] provide decision makers with
reliable benchmark data about relative differences”.
While the current study is most closely aligned to the first of these applications again, with further
development, the approach employed here could be adapted to any of these three use cases.
2.2.2 Data Collection and Analysis
2.2.2.1 Sampling
As detailed above, the study aims to advance an approach towards the measurement of open data
portal user-orientation that, eventually, negates the requirement for data sampling. However, while
analysis of dataset metadata is comprehensive (𝑛𝑛 = 𝑎𝑎𝑎𝑎𝑎𝑎), a limitation of 12 portals overall sets clear

Chapter 2
15
restrictions on the generalisability of findings. Additionally, the extraction of webpage data for the
study was subject to deliberate restrictions in order to limit the size of data in line with time and
resource constraints. Further detail and rationale for both of these sampling approaches are provided
in the following sections.
2.2.2.1.1 Open Data Portal Sampling
The target population for this study was national level European open data portals. The sampling
frame was a list of 81 European open data portals provided by the European Data Portal. The sampling
frame was narrowed to a list of 20 national portals running on the Comprehensive Knowledge Archive
Network (CKAN) data management platform. Portals were limited to those using CKAN as this allowed
for data extraction from a maximum number of portals, minimising the need for data extraction and
processing script customisations within the three-month timeframe of the study.
Following a trial analysis of one portal, a convenience sampling approach was used to select a further
11 portals for the study. Given the timeframe for the study, of which one month was allocated for
data extraction and analysis (including script development), and the requirement for a degree of
customisation to the code for each portal, it was felt that this was an achievable number of portals.
The final sample therefore included the national open data portals for the following countries: Austria,
Croatia, Denmark, Germany, Ireland, Latvia, Netherlands, Romania, Slovenia, Switzerland, Ukraine and
the United Kingdom.
As noted by Bryman (2016, p187) a convenience sample is a non-probability sample “simply available
to the researcher by virtue of its accessibility [and] …it is therefore impossible to generalise the
findings, because we do not know of what population this sample is representative”. The findings from
this study are therefore only directly applicable to the 12 data portals from which data has been
extracted and analysed.
2.2.2.1.2 Webpage Sampling
A second sampling limitation imposed on the study relates to restrictions to the number of webpages
analysed for each portal. As is discussed below, the study draws on two key data sources for each
portal: metadata for each portal dataset, and HTML data scraped from portal webpages using a web
scraping tool. While the first of these data sources, dataset metadata, is complete for all 12 portals
(i.e. 𝑛𝑛 = 𝑎𝑎𝑎𝑎𝑎𝑎 or population level data), webpage data was extracted only for webpages equal to the
subfolder depth at which each portal makes its datasets accessible. This limitation was imposed, first,
to limit the volume of data gathered for the study with respect to timeframe and storage resource
and, second, to ensure analysis of user-orientation was focussed on webpages explicitly provided for
dataset access.

Chapter 2
16
2.2.2.2 Analytics Pipeline
For this study extract, load, transform (ELT) was employed as a variation on the traditional extract,
transform, load (ETL) analytics pipeline. ELT differs from ETL as data transformations are actioned after
the data is loaded into storage and “typically works well when your target system is powerful enough
to handle transformations” (Amazon Web Services, 2016, p9); as the total volume of data processed
for the study was only just over 2GB, all data extractions were loaded into R software environment for
statistical computing and transformations and analysis conducted therein. As the data was limited in
size, neither a database nor virtual machines were used for the study, however, if the project were to
be scaled across a greater number of portals or adopt a real-time or longitudinal approach, storage
and processing requirements would need to be considered.
2.2.2.2.1 Metadata ELT
R software environment for statistical computing was chosen as the programming environment for
metadata ELT, primarily due the rOpenSci ‘ckanr’ package (Chamberlain et al., 2019), which provides a
dedicated metadata extraction interface for the CKAN API. Figure 2-1 provides an example of the ckanr
extraction script used to extract dataset package metadata via the CKAN API for the majority of
portals. The script uses a loop to extract all metadata for all dataset packages for the specified portal
in blocks of 1,000, which are attached to the list vector api_result.
api_result contains a nested list of metadata for all dataset packages for the specified portal. Dataset
packages also contain metadata for any resources – data files - that may be associated to that
package.
Figure 2-1: ckanr dataset metadata extraction script

Chapter 2
17
In some instances, the ckanr package was found not to interface correctly with portal APIs (Austria,
Denmark, United Kingdom). As illustrated in Figure 2-2, for these portals, customs scripts were written
to query the CKAN API directly, extracting dataset package metadata in nested JSON format and
appending to api_result.
As metadata is made available by the CKAN API in either nested list or JSON format, transformation
was required to render data to a relational tabular format – an R ‘dataframe’ - suitable for analysis. A
number of R packages were used to perform this transformation including the tidyverse (Wickham,
2017)5
packages purr and dplyr, and data.table (Dowle, 2019), as well as jsonlite (Ooms, 2018) and rlist
(Ren, 2016) for the portals where the ckanr package could not be used for data extraction. Dataset
package IDs were used as the primary key.
Metadata transformation involved two steps including, first, invocation of a function, ‘m1’, to unlist
and tabulate dataset package metadata and, second, development and invocation of a function, ‘r1’,
to unlist and tabulate metadata pertaining to particular data resource files. A third function, ‘f’, was
invoked from within ‘r1’ to simplify the process of unpacking resources pertaining to particular data
resource files. Figure 2-3 - Figure 2-4 detail the code used for this process.
5
ggplot2 (Wickham, 2017) was used for all data visualisation.
Figure 2-2: Direct CKAN API call dataset metadata extraction script

Chapter 2
18
2.2.2.2.2 Webpage Data ELT
As it was not feasible to develop a custom web crawler within the three-month project timeframe,
portal HTML webpage data was scraped using the industry standard Screaming Frog SEO Spider, a
desktop program which crawls websites' links, images, CSS, script and apps (Screaming Frog, 2019).
Screaming Frog returns data in a relational format for all URIs specified by the crawl configuration,
webpage data required only limited transformation beyond .csv export from Screaming Frog and
superficial treatment of blank values.
The following custom configurations were applied to Screaming Frog to extract the data required for
analysis:
• Crawl depth: for each portal, maximum crawl depth was set as equal to the depth of dataset
pages. For example, datasets for the open data portal for Ukraine reside in at a sub-folder
depth of three: https://data.gov.ua/dataset/c98e830c-e39e-4da6-a13c-f9ba32a79bec;
Figure 2-3: ‘m1’ function declaration for dataset package metadata transformation
Figure 2-4: ‘f’ and ‘r1’ function declaration for dataset resource metadata transformation

Chapter 2
19
• Query strings: for all portals, query strings generated by inbound or internal search traffic
were removed from the crawl to avoid analysis of duplicate base URIs;
• Redirects: for all portals, redirects were followed to ensure crawling of 301 and 302 redirects,
which may be used by some portals.
2.2.2.3 Metric Construction
Following the ELT processes detailed above, dataset metadata and webpage data for all portals was
suitable for analysis according to Simperl and Walker’s (2017, 2019) framework. As Simperl and
Walker’s (2017,2019) framework was originally developed with a manual approach to data analysis in
mind, in order to render the framework suitable for computational application, its composite metrics
required translation to mathematical form.
Table 2-1 - Table 2-10 detail the translation to computationally functional, mathematical form for each
of the 10 categories and 47 metrics proposed by Simperl and Walker (2017, 2019). As indicated,
computational translation was not achieved for a number of metrics as they were found to be either
ambiguous, require development work beyond the scope of this project, or to be duplicates or near-
duplicates. Bearing these restrictions in mind, the final count of measurement categories and
composite metrics operationalised in the research is nine (out of 10) and 26 (out of 47) respectively.
For each metric the original naming and natural language description used by Simperl and Walker
(2019) is provided, as well as the natural language description of the mathematical form and
translation to notation. Also provided are sources of literature that support each translation and the
inclusion of each metric. The mathematical representation of the metrics proposed here draws heavily
on the work of Olsina et al. (2001; see also Bauer and Scharl, 2000, p35; Olsina et al., 1999).
The R scripts used to apply these metrics computationally can be found in the Accompanying
Materials.

Chapter 2
20
Table 2-1: Organise for use metric table
Organise for use: “portals need to be organised for use of the datasets, rather than simply for publication” (Simperl and Walker, 2019, p7):
• The maximum possible score for each metric is '1' i.e. 100% of datasets;
• The maximum mean score for this category is therefore also '1'.
Metric
(Simperl and Walker,
2019)
Natural language description
(Simperl and Walker, 2019)
Calculation 𝟏𝟏,𝟐𝟐
Notation 𝟑𝟑
Source
a. Descriptive record
Each dataset is accompanied by a
comprehensive descriptive record
(going beyond a collection of
structured metadata)
sum datasets with
completed CKAN ‘notes’
metadata field / sum
datasets
∑ 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
𝑁𝑁
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛=1
∑ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑁𝑁
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑=1
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛: Datasets with completed CKAN ‘notes’ metadata
field
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑: Total portal datasets
Primary source: Simperl and Walker (2019)
Secondary validation: Olsina (1999 in Bauer and Scharl,
2000); Huang and Benyoucef (2013); Susser and Ariga
(2006); Jaiswal et al. (2010); Liu and Arnett (2000);
Flavian et al. (2006)
b. Preview
An extract of the data can be
previewed (for easier sense
making)
sum of datasets where
CKAN ‘datastore_active’
metadata field = = TRUE /
sum datasets
∑ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑_𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑁𝑁
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑_𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎=1
𝑁𝑁
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑_𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: Datasets where CKAN
‘datastore_active’ metadata field = = TRUE
Secondary validation: Katerattanakul (2002); Sayogo et
al. (2014)
c. Recommendations
The portal provides
recommendations for related
datasets
As the primary means of linking between datasets for CKAN portals is via keywords, this metric is addressed by metric 1e
d. Ratings
The portal enables users to
review/rate the datasets
As there is no consistent mechanism for permitting user ratings via the CKAN data platform, this metric could not be translated to computational
form
e. Keywords
Keywords from datasets are linked
to other published datasets
sum datasets where CKAN
‘num_tags’ metadata
field >=1 / sum datasets
∑ 𝑛𝑛𝑛𝑛𝑛𝑛_𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
𝑁𝑁
𝑛𝑛𝑛𝑛𝑛𝑛_𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡=1
𝑁𝑁
𝑛𝑛𝑛𝑛𝑛𝑛_𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡: Datasets where CKAN dataset ‘num_tags’
metadata field >=1
Primary source: Simperl and Walker (2019
Secondary validation: Verma and Gupta (2015);
Alexopoulos et al. (2014)
¹”The CKAN Datastore extension provides an ad hoc database for storage of structured data from CKAN resource” (CKAN, 2018a). It must be active for a dataset to be previewed on a portal.
²The CKAN num_tags metadata field provides a count of tags associated with each dataset.
³𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).

Chapter 2
21
Table 2-2: Promote use metric table
Promote use: “promote use of open data portals, through the sharing of knowledge and co-opting methods” (Simperl and Walker, 2019, p9):
• The maximum possible score for metrics 2a, 2b, 2d is ‘1’ i.e. 100% of either datasets or webpages;
• The maximum possible score for metric 2c is also ‘1’ i.e. ‘newsfeed present’=1 or ‘newsfeed absent’=0;
• The maximum mean score for this category is therefore also ‘1’.
Metric
(Simperl and
Walker, 2019)
Calculation 𝟏𝟏,𝟐𝟐,𝟑𝟑
Notation 𝟒𝟒
Source
a. Social media links
The portal is connected with social
media to create a social distribution
channel for open data
sum webpages with social
share code installed / sum
webpages
∑ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑛𝑛
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑙𝑙=1
∑ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑛𝑛
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝=1
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠: Webpages with social share code installed
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝: Webpages crawled
Secondary validation: Alexopoulos et al. (2014); O’Reilly
(2007); Sayogo et al. (2014)
b.
Feedback and
support
The portal provides users with
online support for feedback, to
request/suggest the publication of
new datasets, and when problems
arise during use
sum datasets with at least
one completed CKAN
‘contact’ details metadata
field / sum datasets
∑ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑁𝑁
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐=1
𝑁𝑁
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐: Datasets with at least one completed CKAN
‘contact’ metadata field
2000); Huangand and Benyoucef (2013); Susser and
Ariga (2006); Jaiswal et al. (2010); Liu and Arnett (2000);
c. Newsfeed
The portal provides a way for users
to keep informed of updates to the
data (e.g. news feed)
portal provides newsfeed
functionality (1/0)
𝑁𝑁𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓≥1⇒ 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = 1
𝑁𝑁𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓=0 ⇒ 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = 0
𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓: RSS or ATOM newsfeed URL
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛: Newsfeed metric score
Secondary validation: Sayogo et al. (2014)
d. Guidance
Datasets are accompanied by links
or resources that provide user
guidance and support
sum datasets with
completed CKAN ‘notes’
and CKAN ‘contact’
metadata fields / sum
datasets
∑ (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑖𝑖 + 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖)
𝑁𝑁
𝑖𝑖=1
𝑁𝑁
𝑑𝑑𝑑𝑑𝑑𝑑𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎=1
field
2000); Huang and Benyoucef (2013); Susser and Ariga
(2006); Jaiswal et al. (2010); Liu and Arnett (2000);
e. Examples
Examples of re-use (fictitious or
real) are provided
As there is no standard format for dataset use case examples, this metric could not be translated to computational form
¹The web crawler was configured to identify social share code from Facebook (2019) and Twitter (2019).
²The web crawler was configured for each portal to run to the dataset level page depth (in most cases three sub-folders deep).
³CKAN typically provides three contact fields: ‘contact name’, ‘contact phone’, ‘contact email’.
⁴ 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).

Chapter 2
22
Table 2-3: Be discoverable metric table
Be discoverable: “use good quality metadata and more advanced search tools on portals to improve discoverability” (Simperl and Walker, 2019, p10):
• The maximum possible score for metric 3e is ‘1’ i.e. 100% of datasets;
Metric
(Simperl and
Walker, 2019)
Calculation Notation 𝟐𝟐
Source
a. Publisher portal
The publisher/owner of the data has
an open data portal (or similar
search mechanism)
The metric is implicit in the analysis
b. Searchable datasets
The publisher/owner of that portal
publishes an updated, searchable list
of datasets
This metric could not be translated to computational form
c.
Searchable datasets
with synonyms
publishes an updated, searchable list
of datasets with synonyms
d.
Searchable datasets
that are unavailable
publishes a list of datasets which are
known to exist but are not currently
available (limiting
e. Updated datasets 𝟏𝟏
The publisher/owner of the portal
publishes datasets that are updated
according to a stated updated
frequency
sum datasets where
((current system date –
CKAN ‘frequency’ metadata
field) > (current system
date – CKAN
‘metadata_modified’
metadata field)) / sum
datasets
1. 𝑠𝑠𝑠𝑠ℎ𝑒𝑒𝑒𝑒 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
𝑠𝑠𝑠𝑠ℎ𝑒𝑒𝑒𝑒: Dataset last scheduled update date
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠: System date-time
𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓: Dataset scheduled update frequency as per CKAN
‘update_frequency’ metadata field
Secondary validation: N/A
2. 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢: Dataset last updated
𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚: Dataset last modified as per CKAN
‘metadata_modified’ metadata field
3.
∑ 𝑠𝑠𝑠𝑠ℎ𝑒𝑒𝑒𝑒≥𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢
𝑁𝑁
𝑠𝑠𝑠𝑠ℎ𝑒𝑒𝑒𝑒≥𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢=1
𝑁𝑁
¹This is an additional metric, added to address the ‘update’ component of Simperl and Walker’s (2019) ‘Be Discoverable’ category. The metric calculates the furthest historical date at which a dataset should have been
updated (based on the current system date MINUS a stated update frequency) with the number of days since the dataset was last updated (based on the current system date MINUS the last time the metadata was
modified). The metric returns the proportion of datasets that are up-to-date.
² 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).

Chapter 2
23
Table 2-4: Publish metadata metric table
Publish metadata: “publishing good quality metadata is fundamental to enhance re-use, findability and cataloguing, as well as to make associations and relationships
between datasets” (Simperl and Walker, 2019, p10):
• The maximum possible score for metric 4e is '1' i.e. 100% of possible DCAT-AP mandatory and recommended metadata fields;
• The maximum mean score for this category is therefore also '1'.
Metric
(Simperl and
Walker, 2019)
Calculation Notation 𝟐𝟐
Source
a. Metadata ignorance
Metadata is not documented;
Addressed by metric 4e
b.
Scattered or closed
metadata
Metadata may be partially documented but a) not in a centralised and structured way or b) it is not available and accessible under an open licence;
This metric could not be translated to computational form and is reconfigured under metric 4e
c.
Open metadata for
humans
Metadata is documented and becomes available as "Open Metadata" for re-use, but are not systematically published in a reusable format;
d.
Open reusable
metadata
Metadata is centrally managed, and published as "Open Metadata", in a machine-readable format and/or an API is provided for computers to access, query and re-use the available
metadata;
e.
Linked open
metadata
DCAT-AP mandatory and
recommended metadata fields are
provided by the portal 𝟏𝟏
sum DCAT-AP completed
mandatory and
recommended metadata
fields / sum possible DCAT-
AP mandatory and
recommended metadata
fields
𝑁𝑁
∑ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑁𝑁
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑=1
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑: DCAT-AP completed mandatory and
recommended metadata fields
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑: Possible DCAT-AP mandatory and recommended
metadata fields
Secondary validation: European Commission (2015);
Pullmann et al. (2019); Maali et al. (2010)
¹The original metric was described by Simperl and Walker (2019, p11) as “semantic assets are documented using linked data principles and are managed by advanced metadata management systems”. As this was
considered to be too ambiguous for computation, the revised metric measures the proportion of DCAT-AP v1.2.1 mandatory and recommended dataset metadata fields that are completed for portal datasets. Of the
two mandatory and five recommended fields proposed by the European Commission (2015, p12-13), both mandatory fields (‘description’ and ‘title’) and three of the five recommended fields (‘contact point’, ‘keyword’,
‘publisher’) are included here; the remaining two recommended fields – ‘distribution’ and ‘theme’ – were not considered in the scope of this project.
² 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).
N.B. the identification that a DCAT-AP mandatory or recommended metadata field is complete for a dataset does not indicate that it is provided according to the appropriate RDF Schema vocabulary specified by the
DCAT-AP, only that the equivalent CKAN metadata field is found to be complete. A full DCAT-AP to CKAN metadata mapping table can be found in Appendix D

Chapter 2
24
Table 2-5: Promote standards metric table
Promote standards: “adopting standards is important to ensure interoperability” (Simperl and Walker, 2019, p13):
• The maximum possible score for each metric is ‘1’ i.e. 100% of datasets;
Metric
(Simperl and
Walker, 2019)
Calculation Notation 𝟏𝟏
Source
a.
Permanent,
patterned,
discoverable URI
A permanent, patterned and/or
discoverable URI is used for each
dataset
sum datasets with "URI"
metadata field complete /
sum datasets
∑ 𝑢𝑢𝑢𝑢𝑢𝑢
𝑁𝑁
𝑢𝑢𝑢𝑢𝑢𝑢=1
𝑁𝑁
𝑛𝑛𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜: Datasets with completed CKAN ‘uri’ metadata
field
2000); Berners-Lee (2006)
b. Versioning
The portal uses versioning of
datasets 𝟐𝟐
sum datasets with
completed CKAN ‘version’
metadata field / sum
datasets
∑ 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
𝑁𝑁
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣=1
𝑁𝑁
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣: Datasets with completed CKAN ‘version’
metadata field
Secondary validation: Berners-Lee (2006); Coelpart et
al. (2013)
c.
Dates available in
standard format
Dates are available in a standard
format
Requires within-dataset analysis, which is beyond project scope
d.
Metadata available
in standard format
Metadata associated with each
dataset is available in a standard
format
Addressed by category 4
e.
Metadata catalogue
retrievable using
standard protocol
The metadata catalogue can be
retrieved using a standard protocol
The metric is implicit in the analysis
¹ 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).
²It is possible that portals include ad-hoc versioning information, located outside of the CKAN ‘version’ metadata field, which is not accounted for by this study.

Chapter 2
25
Table 2-6: Co-locate documentation metric table
Co-locate documentation: “[co-locating documentation ensures that] users do not need to be domain experts in order to understand the data” (Simperl and Walker,
2019, p14):
• The maximum possible score for metric 6c is ‘1’ i.e. 100% of datasets;
Metric
(Simperl and
Walker, 2019)
Calculation Notation 𝟏𝟏
Source
a. None
Supporting documentation does not
exist
This metric is addressed by metric 6c
b. Found separately
Supporting documentation exists but
as a document which has to be
found separately from the data
c. Co-located
Supporting documentation is found
at the same time as the data
sum datasets with
completed CKAN "notes"
and "provenance"
metadata fields / sum
datasets
∑ (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑖𝑖 + 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖)
𝑁𝑁
𝑖𝑖=1
𝑁𝑁
field
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝: Datasets with completed CKAN
‘provenance’ metadata field
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑠𝑠: Total portal datasets
2000); Berners-Lee (2006); Coelpart et al. (2013)
d. Linked to dataset
Supporting documentation can be
immediately accessed from within
the dataset but it is not context
sensitive
e.
Linked to specific
dataset points
Supporting documentation can be
immediately accessed from within
the dataset and it is context sensitive
¹ 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).

Chapter 2
26
Table 2-7: Linked data metric table
Linked data: “linking datasets [allows for] …cross-referencing and analysis of multiple datasets to point to previous versions of the same dataset, external datasets not
hosted on the portal or recommendations based on content or user features” (Simperl and Walker, 2019, p15):
• The metrics here are interdependent; each is represented as a proportion of the total number of datasets;
• The maximum possible score for each metric is ‘1’ i.e. 100% of datasets; it does not make sense to average this score.
Metric
(Simperl and
Walker, 2019)
Notation 𝟑𝟑
Source
a. On the web
Make your stuff available on the
Web (whatever format) under an
open licence
sum open licence datasets
classified as ‘on the web’ /
sum datasets
∑ (𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖 + 𝑤𝑤𝑤𝑤𝑤𝑤𝑖𝑖)
𝑁𝑁
𝑖𝑖=1
𝑁𝑁
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜: Datasets with open licence specified in CKAN
‘licence’ metadata field
𝑤𝑤𝑤𝑤𝑤𝑤: Datasets with format classified as ‘on the web’
Secondary validation: Berners-Lee (2006)
b. Machine readable
Make it available as structured data
(e.g. Excel instead of image scan of a
table)
classified as ‘machine
readable’ / sum datasets
∑ (𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖 + 𝑚𝑚𝑚𝑚𝑚𝑚ℎ𝑖𝑖)
𝑁𝑁
𝑖𝑖=1
𝑁𝑁
𝑚𝑚𝑚𝑚𝑚𝑚ℎ: Datasets with format classified as ‘machine
readable’
c.
Non-proprietary
format
Make it available in a non-
proprietary open format (e.g. CSV
instead of Excel)
classified as ‘non-
proprietary’ / sum datasets
∑ (𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖 + 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑖𝑖)
𝑁𝑁
𝑖𝑖=1
𝑁𝑁
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛: Datasets with format classified as ‘non-
proprietary’
d. RDF standards
Use URIs to denote things, so that
people can point at your stuff
classified as ‘RDF standard’
/ sum datasets
∑ (𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖 + 𝑟𝑟𝑟𝑟𝑟𝑟𝑖𝑖)
𝑁𝑁
𝑖𝑖=1
𝑁𝑁
𝑟𝑟𝑟𝑟𝑟𝑟: Datasets with format classified as ‘RDF standard’
e. Linked RDF
Link your data to other data to
provide context
This metric is addressed by metric 7d
¹The look-up table used to classify data file formats is supplied as supporting material.
²Creative Commons or other licences explicitly stated as ‘open’ in the metadata were considered to constitute an open licence; datasets where the licence format was unknown, explicitly closed or missing were
considered to be ‘closed’.
³ 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).

Chapter 2
27
Table 2-8: Be measurable metric table
Be measurable “open data portals should be measurable to assess how well they are meeting users’ needs” (Simperl and Walker, 2019, p16):
• The metrics here are interdependent; each is represented as a proportion of the total number of webpages;
• The maximum possible score for each metric is ‘1’ i.e. 100% of webpages; it does not make sense to average this score.
Metric
(Simperl and
Walker, 2019)
Notation 𝟑𝟑
Source
a. No analytics Portal has no analytics
sum webpages without web
analytics tracking code
installed / sum webpages
∑ 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
𝑛𝑛
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛=1
∑ 𝑝𝑝𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔
𝑛𝑛
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛: Webpages without analytics tracking code
installed
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝: Webpages crawled
Secondary validation: NA
b. Site analytics Portal has site analytics
sum webpages with web
analytics tracking code
installed / sum webpages
∑ 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑛𝑛
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎=1
∑ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑛𝑛
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎: Webpages with analytics tracking code
installed
c. Use analytics Portal has use analytics This metric could not be translated to computational form and is discussed in the limitations section
d. Impact analytics Portal has impact analytics This metric could not be translated to computational form and is discussed in the limitations section
¹The web crawler was configured to identify web analytics tracking code from Google Analytics (2019a) – analytics.js and gtag.js versions - and Matomo (2019a; formerly Piwik).
²The web crawler was configured for each portal to run to the dataset level page depth (in most cases three sub-folders deep).
³ 𝑁𝑁 denotes population level calculation (i.e. 𝑁𝑁 = 𝑎𝑎𝑎𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑); 𝑛𝑛 denotes sample or subset calculation (i.e. 𝑛𝑛 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤).

Chapter 2
28
Table 2-9: Co-locate tools metric table
Co-locate tools: “co- locate tools so that a wider range of users and re-users can be engaged with the datasets of an open data portal” (Simperl and Walker, 2019, p18):
• No metrics are calculated for this category.
Metric
(Simperl and
Walker, 2019)
Calculation Notation Source
a. No tools
The portal does not provide
visualisation or collaboration tools
for users to engage with the datasets
This category could not be translated to computational form and is discussed in the limitations section
b. Visualisation tools
The portal provides visualisation
tools to enable users to engage with
the datasets
c.
Visualisation and
collaboration tools
(moderated)
The portal provides visualisation and
collaborations tools to enable users
to participate in the governance of
the portal (e.g. dataset rating) but
the engagement with other users is
limited or mediated by the
administrator
d.
Visualisation and
collaboration tools
(unmoderated)
The portal provides visualisation and
collaborations tools to enable users
to collaborate innovatively with
other users

Measuring Open Data Portal User-Orientation: A Computational Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Measuring Open Data Portal User-Orientation: A Computational Approach

Similar to Measuring Open Data Portal User-Orientation: A Computational Approach (20)

Recently uploaded

Recently uploaded (20)

Measuring Open Data Portal User-Orientation: A Computational Approach