Open Government Data on the Web - A Semantic Approach


Published on

(upload with permission from Armand Brahaj)

Initiatives of making governmental data open are continuously gaining interest recently. While this presents immense benefits for increasing transparency, the problem is that the data are frequently offered in heterogeneous formats, missing clear semantics that clarify what the data describes. The data are displayed in ways that are not always clearly understandable to a broad range of user communities that need to make informed decisions.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Open Government Data on the Web - A Semantic Approach

  1. 1. 1 Open Governmental Data on the Web: A Semantic Approach Julia Hoxha Institute of Applied Informatics and Formal Description Methods (AIFB) Karlsruhe Institute of Technology, Germany Armand Brahaj FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Germany to knowledge representation. Abstract— Initiatives of making governmental data open are From a technical perspective, we refer as Open Data to anycontinuously gaining interest recently. While this presents sets of data which can be reused with no restrictions by anyimmense benefits for increasing transparency, the problem is that form of licensing or patents, data that are well structured andthe data are frequently offered in heterogeneous formats, missingclear semantics that clarify what the data describes. The data are can be easily accessed and reused by institutions, scientist ordisplayed in ways that are not always clearly understandable to a the web community.broad range of user communities that need to make informed Being driven by benefits of the aforementioned paradigm,decisions. we aim to increase openness of data offered by different We address these problems and propose an overall approach, governmental institutions, and at the same time, provide meansin which raw statistical data independently gathered from the of representing these data in more understandable ways. Theredifferent government institutions are formally and semantically are several challenges underlying our work, starting from therepresented, based on an ontology that we present in this paper. way that data is originally provided by the sources. Data areWe further introduce the approach we deploy in publishing thesedata in alignment with Linked Data principles, and present the mostly offered in heterogeneous formats, missing clearmethods implemented to query single or combined dataset and semantics that clarify what the data describes, and displayed invisualize the results in understandable ways. The introduced ways that is not clearly understandable to a broad range userapproach enables data integration, leading to vast opportunities communities that need to make informed decisions. Thesefor information exchange, analysis on combined datasets, kinds of representation impede the integration of data comingsimplicity to create mashups, and exploration of innovative ways from different sources, consequently leading to limitedto use these data creatively. opportunities for information exchange, analysis on combined datasets, simplicity to create mashups, and exploration of innovative ways to use these data creatively. Keywords— linked open data, open governmental data, From a technical perspective, our work consists in applyingontology, semantic web, statistical metadata Semantic Web technologies to enable data integration among the different organizations and establish links to interconnect data together on the Web. Since a crucial issue whenI. INTRODUCTION semantically modeling data is the preservation of theirI NITIATIVESof making governmental data open are gaining great interest recently, with more countries constantlyembracing the open data paradigm. Open Data is a new form meaning, we present an approach, which deploys Linked Open Data [] principles and introduces an ontology as foundation for the semantic representation of governmentof representation of the traditional public information, which statistical data. On the one hand, the ontology offers a formal,has been the basis of decision making in political, economical explicit descriptions of knowledge enabling in this way aor sociological activities for centuries. platform for integration, where data can be still interpreted The term open has been introduced in the late 2000th to correctly for valid research outcomes. On the other hand,generalize the accessibility of the information that can be Linked Data provides a technical basis for publishing, sharingpublic and easily accessible by anyone. The concept of and linking data on the web, based on standardized formatsinformation is also generalized to the concept of data, by and interfaces.referring to raw sets of valuable information which can be In this paper, we present the ontology that enables formal,structured, analyzed and presented in different forms leading semantic representation of government statistical data. We further introduce the approach we deploy in publishing them
  2. 2. 2as linked data, and the methods implemented to query single or alignment with Linked Data principles, which we present incombined dataset and visualize the results. The semantic Section 3.approach and the ontology are described in details in Section Another important component in our approach is the Query2, whereas Section 3 gives an overview of Linked Open Data Interface, which enables the user community as well as theand our work in alignment to those principles. Section 4 source institutions that offer these statistical data to access thepresents the technical implementation of our approach knowledge base and pose queries upon it. This componentfocusing on the issues of integration and processing of data consists of an online graphical interface as well as a SPARQL2from different sources and querying and visualization Endpoint, which we both present in Section 4. The results of atechniques for representing multiple datasets in a graph. We query may be displayed as structured Excel and RDF files togive an overview of related work in the field of open data and the users, but they are additionally visualized in ways thatstatistical vocabularies in Section 5, concluding in Section 6 make these data much more tangible and understandable. Thiswith a summary of our work. approach provides open, structured and linked data to the different institutions, which are then able to use each-others datasets very easily.II. SEMANTIC APPROACH In our work, we are dealing with a large set of sources from A. ODA Ontologywhich statistical data is gathered. These sources are mainly The need to provide a unified representation for the different,websites of different government institutions, which offer data heterogeneous data gathered from the various sources, as wellonline in unstructured or semi-structured formats such as text as the necessity to provide formal semantics to these data, havedocuments, excel files or XML files. There are very few greatly motivated us to choose a semantic approach using ansources that can provide data structured in Entity-Relationship ontology, referred as ODA3 ontology, which we makemodel. available online4. While modeling the ontology, our goal was Our goal, after gathering the raw data independently from to follow linked data principles and reuse as much as possiblethe different sources, is to convert them to a representation existing well-known ontologies, but at the same time extendingbased on a unified structure and provide semantics for each of it with other important entities.the data entries. For this purpose, we have chosen to An ontology is defined as a “a formal, explicit specificationsemantically represent the data based on an ontology that we of a shared conceptualization” [1]. It provides a commonhave built, and then further serialize data as RDF triples understanding of the domain knowledge, increasingfollowing Linked Data principles. In this section, we present communication capabilities between people and heterogeneousthe modeled ontology as well as introduce our approach in applications. Using an ontology, we are able to formallypublishing Linked Open Data. describe the semantics of terms and items of statistical datasets and give explicit meaning to the provided information. Besides sharing a common understanding, another direct benefit of a formal semantics is the support for machine-readability, which then enables automated reasoning, information integration and application of intelligent approaches such as semantic search. The objective of the modeled ontology is to provide a schema (also referred to as vocabulary) for the semantic representation of statistical data, which, on one side, is understandable and simple to be used from different institutions or organizations that offer statistical datasets, and on the other side, is consistent and complete enough to model all the aspects and entries in these datasets. A dataset is created as a container for the data entries found in a particular statistical file. Each dataset contains information about the date it was created, its creator and publisher, and a title. The dataset contains data entries, which represent the central component modeled in our ontology. A data entry has a fixed value and, among other possible dimensions, is measured for a particular year and a specific country. In a dataset, the data entries serve to measure particular Fig. 1. Overall semantic approach for publishing open governmental data indicators that monitor for example the socio-economic development of a country, demographic changes, public In our approach, the knowledge base consists of the health, developments in education, and many other aspects.Ontology Model and a storage for RDF files. The rawstatistical data are converted to RDF1 using a converter in 2 3 ODA stands for Open Data Albania 1 4 Resource Description Framework,
  3. 3. 3These numerous indicators, also organized among each-other URI: groups, are additionally assigned particular topics. Each Superclass: skos:Conceptindicator is used to estimate and monitor the progress done ina particular aspect. Besides, a more complete view is providedwhen looking at the progress of all indicators within a topic. Class: oda:DataEntry DataEntry- a data entry dataset, represents a single piece of All these concepts are modeled in the ontology, which is dataillustrated in Figure 2. It is encoded using the sublanguage URI: DL of the Web Ontology Language (OWL)5, which goes Superclass: event:Eventbeyond other standards in its power to represent machineinterpretable content on the Web. Moreover, OWL DL itselfoffers high expressiveness. Class: oda:Indicator Indicator - a statistical indicator that is being measured by the entries in the dataset URI: Class: oda:Topic Topic - a theme to which the indicator and dataset belong URI: Class: oda:Dimension Dimension - a dimension of a statistical data entry URI: Class: oda:Year Year- time dimension, represents the year to which a statistical data entry belongs URI: Superclass: oda:Dimension Class: oda:Country Country- spatial dimension, represents the country to which a statistical data entry belongs URI: Superclass: oda:Dimension Properties. The relations among the different individuals of the ontology classes are defined by using the properties construct, distinguishing between two main types: Object properties and Datatype properties. The Object properties link individuals to individuals, whereas the Datatype properties link individuals to data values. These are a few of the main Object properties modeled in our ontology: Property: oda:dataset Fig. 2. ODA Ontology dataset- relates a statistical data entry to a dataset Domain: oda:DataEntry Range: oda:Dataset We have modeled the ontology using Protégé6, a platformthat offers support for ontology creation, visualization and Property: oda:indicatormanipulation. We explain below the main entities that indicator- relates a statistical data entry to an indicatorcompose the modeled ODA ontology. Domain: oda:DataEntry Classes. The concepts and terms of the domain of Range: oda:Indicatorknowledge are represented in an ontology as classes. Ourontology consists of the following main classes: Property: oda:subindicator Class: oda:Dataset subindicator- relates a parent indicator to other indicators Dataset - a statistical dataset, represents the container of a classified as its subcomponents (serves for grouping set of data indicators) Domain: oda:Indicator 5 Range: oda:Indicator 6
  4. 4. 4 Property: oda:dimension period indicator- relates a statistical data entry to an indicator Ontology: Statistical Data and Metadata Exchange (SDMX) Domain: oda:DataEntry vocabulary10 Range: oda:Dimension Property: dc:title, dc:publisher, dc:creator, dc:date We have also defined a set of Datatype properties e.g. Ontology: Dublin Core Metadata Element Set11oda:indicatorname or oda:topicname properties, whichassign a literal value (in this case string) to the individuals of III. LINKING OPEN GOVERNMENTAL DATAthe class Indicator and Topic, respectively. Making data freely available for everyone on the Web is further empowered when there are well-formed links Individuals. The classes of an ontology are instantiated, established among the items of the different data sources.specifying the concrete objects or individuals that belong to a Making data open and linked is the goal of Linked Open Dataparticular class. We have created a large number of topic paradigm [2], which promotes the ways of providing data thatinstances such as Social Development, Economic is freely accessible, verifiably, and connected to other dataDevelopment, Social Inclusion, Public Health, Education, etc. sources.We have related these topics to indicators, such that a topic There are four basic principles expected to be fulfilled whenmay have multiple indicators assigned to it. In order to aiming to publish linked data, in order to make these dataformalize the topics consistently, we have also consulted the interconnected [3]:way statistical data is being organized into particular themes  Use Uniform Resource Identifiers (URIs) as names tofrom Eurostat7, the statistical office of the European Union. identify things The indicators are modeled after a thorough analysis of the  Use HTTP URIs so that people can refer to and lookstatistical data offered by the different governmental up (“dereference”) those namesinstitution. They are organized in groups using the  When someone looks up a URI, provide usefuloda:subindicator relation, forming this way a hierarchy of information using standards such as RDF, SPARQLparent-child indicators.  Include links to other URIs, so that more things and We have currently created in our ontology instances of more related information can be discovered on the Webthan 100 indicators. Just to give an insight on what do theseindicators consist, we mention below a few of them belonging From a technical perspective [5], the objective is to useto the topic of Economic Development: common standards and techniques to extend the Web by  GDP Indicators: GDP nominal per capita, GDP nominal publishing data as RDF (graph data format for representing growth rate, GDP per capita in PPS (for EU states) information on the Web), creating well-formatted RDF links  Budget Indicators: Expenditures (Total Expenditures, between the data items, and performing search on the data via Capital Expenditures, Current Expenditures, standardized languages such as SPARQL query language for Domestic/Foreign Financing Expenditures), Revenue, RDF. Deficit (Deficit Growth Rate, Deficit Financing), etc. Recognizing the many benefits of Linked Open Data, we  Inflation Indicators: monthly Inflation, yearly Inflation have also followed an approach that is in alignment with these rate. principles, illustrating it with the following example in Figure 3. The figure shows an excerpt of the RDF/XML Aiming to rigorously follow the Linked Data principles, representation of the information contained in a dataset on theexplained in more details in Section 3, one of our goals while yearly inflation rates of different countries in Europe.modeling the ontology was to reuse existingvocabularies/ontologies. The following entities are used from <rdf:Descriptionthe ontologies that we have imported: rdf:about=""> Class: event:Event <rdf:type rdf:resource=""/> Event – An arbitrary classification of a space/time region, <dc:title>Inflation Rate per year and country</dc:title> <dc:publisher>International Monetary Fund (IMF)</dc:publisher> by a cognitive agent. An event may have actively <dc:creator>Open Data Albania (ODA)</dc:creator> participating agents, passive factors, products, and a <dc:date>2011-03-22T10:56:40+0000</dc:date> location in space/time. </rdf:Description> Ontology: The Event Ontology8 <rdf:Description rdf:nodeID="A0"> <oda:dataset Class: skos:Concept rdf:resource="/yearlyinflationrate#dataset"/> Ontology: Simple Knowledge Organization System <oda:indicator (SKOS)9 rdf:resource=""/> <oda:country rdf:resource=""/> Property: sdmx-measure:obsValue <oda:year rdf:resource=""/> obsValue- The value of a particular variable at a particular <sdmx-measure:obsValue rdf:datatype=""> 7 8 10 9 11
  5. 5. 5 3.498 </sdmx-measure:obsValue> of records and examples of visual representation of datasets <rdf:type rdf:resource=""/> generated by visitors or editors of the website. </rdf:Description> A visualization service could be delivered through the site and could include analytics, graphics, charting, and other ways Fig. 3. Excerpt of a Dataset in RDF/XML Representation of using the data. The enhanced visualizations is built on top of published APIs in collaboration with third party open Based on the aforementioned principles, well-formed URI source applications (Fig. 4).references have been used in our approach to identifyresources such as dataset, indicator, country, year, etc. Wehave invested on a technical infrastructure to make the URIsdereferenceable, making it able to retrieve further anRDF/XML description and various classifications of theidentified resource when accessing its URI. RDF links toresources from other data sources are established, enablingvisitors and user agents to navigate the Web of data. Examplesinclude direct links to DBpedia12 for resources such as countryand year, or references via the rdfs:seeAlso property for theIndicators, in the case when the link exists, e.g. the indicator“GDP Nominal per Capita” would have a link to therespective resource13 in DBpedia. Reusability of existing vocabularies is also a requirementthat tends to facilitate data processing from different clientapplications. Accordingly, terms from well-known ontologies Fig. 4. Data layers and tools implementedhave been reused, but we have certainly complemented thesevocabularies with additional concepts for representing the Our work consists of two stages. The first stage is thestatistical data. Particularly important to mention in our aggregation of the data and the processing of linked dataapproach are the concept and taxonomy of indicators as well structures. The second stage consists on providing a set ofas topics, which are additionally modeled in the ontology and tools that can make use of these data. As an example of thedefined in our own namespace. Common constructs for technology that can be used, a set of visualizationproviding mappings to other ontologies are rdfs:subClassOf representation are generated. The visual representations areor rdfs:subPropertyOf. As already shown in the presented published together with a comment providing some additionalontology above, we have modeled a set of terms as subclasses information on the dataset retrieved. The operations from theof concepts (same as for properties) belonging to other aggregation of data up to the publishing of examples goesvocabularies. through the following main steps. The linked data on the web is primarily intended to beprocessed by machines, but there is an increasing interest fromthe human visitors to access and navigate through the data. A. Aggregation of the informationTherefore, we have aimed towards an approach that is Data is aggregated from periodical governmental agencyconvenient for both parts. On the one hand, human readability reports. The format of these reports is most of the timeof the data is facilitated via provision of textual information provided in Pdf files and sometimes in spreadsheet format. All(possibly in different languages) in forms of comments and the information is stored in an offline repository for originallabels, using respectively the properties rdfs:comment and archiving purposes.rdfs:label. On the other hand, our objective is to ease machine B. Data Processinginterpretability, stating important information explicitly, suchas specification of domain, range and cardinality in our The information gathered in the aggregation step is semi-ontology. We provide information in such a way that avoids automatically processed to a well-organized Linked Datainconsistencies in data representation, but at the same time structure. In order to respond to all the ontologies on theallows flexibility of the schema to be reused by other parties system, a set of spreadsheet templates is generated. Thefor data modeling. information is imported into the spreadsheets and then automatically transferred into RDF structure through a toolIV. TECHNICAL IMPLEMENTATION referred to as ODA Import to RDF. The RDF datasets are stored in a CKAN repository, which We aim to provide access to all the citizens, technically is made public and can be accessed via the CKAN webinclined or not, to discover structured data, and reuse them. To interface and CKAN API.serve up these data sets, the ODA website accesses a catalog Beside the interface of CKAN, the datasets are also stored in the ODA repository, which is directly connected to an API.12 The Open Data API is a RESTful, service-oriented platform13
  6. 6. 6that allows developers to easily access datasets and create data.addColumn(number,Year); data.addColumn(number,Value);independent services through these calls. REST uses the HTTP data.addRows([protocol and, as such, requests use the common URL format. ["Albania",2010,25], ["Albania",2011,25], The API provides simple methods that developers can use to ["Croatia",2010,37],tap into the functionality and rich datasets, and gather ["Croatia",2011,37],information, in JSON or XML format, related to different ["Greece",2010,39], ["Greece",2011,42]indicators and topics. ]); All requests on the ODA API should consist of a host name(" "), a path that contains an object and The result provided by the SPARQL Endpoint is modifiedan action (e.g. "/indicator/info" or "/indicator/list"), and a by a php script to generate the Data Table required by thequery string that contains the method parameters (e.g. Visualisation API (See Fig 5)."apiKey", "term", "indicatorId", etc.). The URL template of a request looks like: Another way of accessing the information from the RDFdatasets is through a SPARQL endpoint. The SPARQLendpoint is protocol service as defined in the SPROT14 specification. The SPARQL endpoint enables users to querya knowledge base via the SPARQL language. The commonresult from a SPARQL endpoint is an XML-encoded SPARQLresults. In order to access third party Visualization tools, wegenerate the result sets by default in JSON format. C. Graphical Visualization One of the values of making data available is that it enablesanyone to construct nice visualizations over the data. This isparticularly useful for public sector information since it can Fig. 5 Motion Chart generated through SPARQL queryprovide feedback on how effective particular policies havebeen. It is important that the generation of visual graphs should be D. Community contributionsdone on the fly as soon as result-sets are retrieved by the In order to provide a clear example of how the system willSPARQL Endpoint. For this reason, a set of open source visual be used, a set of starting articles that based on datasets andapplication interfaces were evaluated and we chose to work generated graphs is created. An article consists of thewith Google Visualization API15. representation of one or more graphs with a respective In order to generate graphs on the fly the following elements description.were needed: The graphic representation of the data assists the audience to  A RDF document: in order to present your data through draw concrete and logical conclusions on the progress or Google Visualization, a source of RDF document is regress of certain indicators that influence socio-economic needed. activity in a domain that the dataset covers. These articles are A SPARQL query endpoint: a Sparql query endpoint is examples that the community may follow to contribute with needed to process your queries and get the results more content, which can be generated by any visitor based on various self-generated graph representations.  Google Visualization: V. RELATED WORK Google Visualization API relies heavily on Data Table, a The initiatives of making government data open are gainingtable arranged in typed columns. An instance of this class is a lot of interest recently. While more countries are embracingthe result of a request to a data source. A Data Table can be the Open Government paradigm, among the researchersrendered in many ways: JSON, HTML, CSV. working with those data there is an increasing awareness inIn order to feed the Visualization API a similar code should be using semantic techniques to represent them. Two of the keygenerated by the application. leading players that promote the openness of government data combined with the deployment of Semantic Web technologiesvar data = new google.visualization.DataTable(); are the US government[7] and the UK government[6].data.addColumn(string,Country); 14 SPARQL Protocol for RDF 15
  7. 7. 7 1617 In the respective portals manifesting the outcomes of VI. CONCLUSIONtheir work, there are numerous data catalogs and respective Nowadays, there is a lot of attention to Open Data projectsmetadata published, many of the datasets offered as RDF in many countries and cities all over the world. It is a commontriples. has made a number of its datasets available as understanding today that Open Data is simply data on theLinked Data, altogether summing up to 6.5 billion triples. The web, whereas Linked Data is a web of data.goal of these projects is to improve access to the data In this paper we address the problem of data provided ingenerated by the government, increase the ability of the people heterogeneous formats, omitting semantics that clarify whatto find desired data, so that they can be further used in various the data describes. Our approach is based on a semanticcreative ways. In order to access the data, they offer tools to ontology providing semantic description of all the data, whichsearch by keywords, category and agency. Detailed statistics on the EU and candidate countries are are further published in alignment with Linked Data principles.provided by Eurostat[9], which is continuously publishing The paper offers also an approach for the provision ofstatistical data in tabular formats, but not in RDF. Unlike other visualization services deriving from RDF result sets. This isportals, its website provides a clear structure for organizing the achieved by querying the RDFs with SPARQL syntax anddatasets by themes and by their respective indicators, but this passing the result to an open-source visualization API.structure is HTML-based and not semantically described. The ontology created for the ODA and the supporting tools Similar to these projects, our approach provides ways to built on top of it describe a method of publishing structuredeasily access government data. Unlike them, it concentrates on data, so that it can be interlinked and become more useful topromoting a well-defined, unified schema for the semantic the community.representation of data, in order to increase interoperability andintegration among the different providers. A novel aspect ofour work consists in providing a semantic representation of thedifferent indicators that can be measured using the datasets, as REFERENCESwell as the topics to which they belong. This offers for any [1] Studer, R., Benjamins, R.V., Fensel, D.: Knowledge engineering:specific area a complete picture, making it easier to monitor Principles and methods. Data and Knowledge Engineering, 25, 161–197the progress, make estimations or take a comparative (1998).approach. [2] Bizer, C., Heath, T., and Berners-Lee, T. (2009). Linked Data - The Story So Far. International Journal on Semantic Web and InformationFrom the semantic web perspective, there exist a limited Systems (IJSWIS), Vol. 5(3), Pages 1-22. DOI:number of vocabularies for representing statistical data such as 10.4018/jswis.2009081901.SCOVO [4] and the RDF Data Cube Vocabulary18. While [3] Berners-Lee, T. (2006). Linked Data - Design Issues. Retrieved 2010-SCOVO does not provide semantics about the items that the 10-25 from [4] Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., and Ayers, D.statistical data describe, the Data Cube vocabulary is more (2009). SCOVO: Using Statistics on the Web of Data. Proceedings ofextended and allows modeling of complex data structures and the 6th European Semantic Web Conference on the Semantic Web:sets, as well as ways to link data with other sources. An Research and Applications (Heraklion, Crete, Greece, 2009). Pages 708-approach that uses the Data Cube vocabulary to model 722. [5] Bizer, C., Cyganiak, R., and Heath, T. (2007). How to publish Linkedstatistical data and publish them as Linked Data is the Eurostat Data on the Web. Retrieved 2010-10-25 from http://www4.wiwiss.fu-wrapper[8]. Another approach uses this vocabulary to represent statistical data from social sciences, [6] J. Sheridan and J. Tennison. Linking UK government data. Linked Datafocusing on the capabilities of performing statistical on the Web (LDOW2010) Workshop at WWW2010, April 2010calculations and analysis on linked data from heterogeneous [7] L. Ding, D. DiFranzo, A. Graves, J. R. Michaelis, X. Li, D. L. McGuinness, and J. Hendler. Data-gov wiki: Towards linkingand distributed datasets. government data. In AAAI Spring Symposium on Linked Data MeetsWhile our work encompasses more aspects such as mining of AI, 2010.the statistical data, converting and publishing as linked data, [8] Zapilko - Enriching and Analysing Statistics with Linked Open Datadata aggregation and integration, as well as search and [9] Eurostat Wrapper - the experimental Eurostat wrapper to RDF modules, the underlying semantic approach aimsat providing semantics of these data based on an ontology,similar to the Data Cube vocabulary. While this vocabularypresents a core foundation for describing data, our ontology isdesigned to accommodate extended semantics of government-oriented statistical data. Furthermore, we tend to reusecomponents of various vocabularies, adhering in this way tolinked data principles. 16 17 18 http://publishing-statistical-