SlideShare a Scribd company logo
1 of 28
Page 1
NANYANG TECHNOLOGICAL UNIVERSITY
Wee Kim Wee School ofCommunication & Information
Linked Data Migrational Framework for Singapore Government Data Sets
Technical Report
Prepared under the guidance of
Dr. Khoo Soo Guan, Christopher (Assoc Prof)
Submitted by
SESAGIRI RAAMKUMAR ARAVIND (aravind002@e.ntu.edu.sg)
THANGAVELU MUTHU KUMAAR (muthu1100@gmail.com)
KALEESWARAN SUDARSAN (sudarsan.srcm@gmail.com)
Page 2
Table of Contents
1.0 ABSTRACT.............................................................................................................................................................4
2.0 INTRODUCTION.....................................................................................................................................................4
3.0 METHODOLOGY....................................................................................................................................................5
3.1 Analysis.........................................................................................................................................................5
3.1.1 Artifacts Analyzed................................................................................................................................6
3.1.2 Source Data Forms ..............................................................................................................................6
3.1.3 ExistingAPI SERVICESfor Data Consumption.........................................................................................6
3.1.4 Issues identifiedwith the Existing DGS System ......................................................................................6
3.2 Design...........................................................................................................................................................8
3.3 Evaluation.....................................................................................................................................................8
4.0 MIGRATIONAL FRAMEWORK .................................................................................................................................9
4.1 STEP 1 - Specification ............................................................................................................................................9
4.1.1 SUB STEP 1 - IdentificationActivities at Program Level ..................................................................................9
4.1.2 SUB STEP 2 - Analysis Activities at Project Level ..........................................................................................12
4.1.3 Identified Issues........................................................................................................................................12
4.1.4 Recommended Tools.................................................................................................................................12
4.2 STEP 2 - Object Modeling.....................................................................................................................................12
4.2.1 Identified Issues........................................................................................................................................13
4.2.2 Recommended Tools.................................................................................................................................13
4.3 STEP 3 - Ontology Modeling.................................................................................................................................13
4.3.1 Issues.......................................................................................................................................................14
4.3.2 Recommended Tools.................................................................................................................................15
4.4 STEP 4 - URI Naming Strategy...............................................................................................................................15
4.4.1 Selection of URI Administration Modes ......................................................................................................15
4.4.2 URI Taxonomy ..........................................................................................................................................15
4.4.3 Identified Issues........................................................................................................................................16
Page 3
4.4.4 Recommended Tools.................................................................................................................................16
4.5 STEP 5 - RDF Creation..........................................................................................................................................16
4.5.1 Static Files Conversion (S2R)......................................................................................................................17
4.5.2 Relational Database Conversion (D2R)........................................................................................................17
4.5.3 Application Programming Interface Conversion (A2R) .................................................................................17
4.5.4 Identified Issues........................................................................................................................................18
4.5.5 Recommended Tools.................................................................................................................................18
4.6 STEP 6 - External Linking......................................................................................................................................18
4.6.1 Identified Issues........................................................................................................................................19
4.6.2 Recommended Tools.................................................................................................................................19
4.7 STEP 7 - Datasets Publication...............................................................................................................................19
4.7.1 Triple Store...............................................................................................................................................19
4.7.2 Metadata Publication................................................................................................................................19
4.7.3 SPARQL and Linked Data API......................................................................................................................20
4.7.4 Linked Data External Hosting .....................................................................................................................20
4.7.5 Identified Issues........................................................................................................................................21
4.7.6 Recommended Tools.................................................................................................................................21
4.8 STEP 8 - Exploitation and Discovery......................................................................................................................21
4.8.1 Identified Issues........................................................................................................................................22
4.8.2 Recommended Tools.................................................................................................................................23
5.0 Conclusion..........................................................................................................................................................23
6.0 References..........................................................................................................................................................23
7.0 Appendix............................................................................................................................................................25
7.1 Singapore Data Eco system – Extended Analysis ............................................................................................25
7.1.1 Data transfer and storage in data.gov.sg (DGS) system ........................................................................25
7.1.2 Data upload and download specifications ...........................................................................................25
7.1.3 Destination data forms ......................................................................................................................25
Page 4
7.2 Linked Data Applications..............................................................................................................................26
7.3 Other recommendations..............................................................................................................................28
7.3.1 Kasabi for Linked Data Hosting...........................................................................................................28
1.0 ABSTRACT
The subject area of this report is Linked Data and its application to the Government domain. Linked Data is an alternative
method of data representation that aims to interlink data from varied sources through relationships. Governments around
the world have started publishing their data in this format to assist citizens in making better use of public services. This
report provides an eight step migrational framework for converting Singapore Government data from legacy systems to
Linked Data format. The framework formulation is based on a study of the Singapore data ecosystem with help from
Infocomm Development Authority of Singapore. Each step in the migrational framework comprises of objectives,
recommendations, best practices, entry & exit points and issues. This work builds on the existing Linked Data literature,
implementations in other countries and cookbooks provided by Linked Data researchers. iDA can use this report to gain
an understanding of the effort and work involved in the implementation of Linked Data system on top of their legacy
systems. The Appendix section of this document and two companion documents contain details about evaluation of the
recommended Linked Data tools with two Singapore data sets and a tools inventory applicable to the framework. The
framework is be evaluated by building a Proof of Concept (POC) application.
2.0 INTRODUCTION
‘Linked Data’ is a standard establishing a common format for describing data sets. It allows publication in the global open
data cloud1
to link with other data sets, thereby forming a part of the web of data. This study focuses on the
implementation aspects of such a data standard for Infocomm Development Authority (iDA)’s data.gov.sg (DGS) portal.
The portal publishes the governance data provided by different ministries and agencies in Singapore. The current
government data eco system has been studied and a customized Linked Data migration framework has been designed by
selectively using the conversion tools and techniques in different stages of migration. The migration steps are elaborated
by using population trends and urban authority’s sites data sets.
The cookbooks, guidelines and recommendations provided by the Linked Data research communities were helpful in
understanding the general steps and tools required in converting and publishing government data in Linked Data format.
Governments that are new entrants in adopting Linked Data need a tailored data migration framework specific to their
local data ecosystem. Existing case-studies recommend different tools for migration thereby creating an additional
overhead of tool evaluation before implementation. Therefore, it is necessary to recommend the best tool based on
evaluation in the local context.
The current study aims to build a Linked Data migrational framework that could be used by iDA and Singapore
Government agencies to publish their data sets in the form of Linked Data. The framework build process is based on the
metadata and specifications provided by iDA and government agencies. The explanation of each step in the framework is
given with the two data sets provided by Urban Redevelopment Authority and Department Of Statistics (mentioned in
Table 1). Each step in the framework comprise of objectives, recommendations and issues identified by us based on
careful evaluation. The current study focuses on linking the internal data sets. Additionally, it aims to provide
recommendations on a few use-cases that leverage the utility of external Linked Data.
1 LOD cloud http://richard.cyganiak.de/2007/10/lod/
Page 5
Data set Agency Category Data type Existing
Applications
Resident Population by DGP Zone/
Subzone and Age Group, Type of
Dwelling, Ethnic Group2
Department of
Statistics
Population and
Household
Characteristics
Textual Singstat Time
Series (STS),
charts on singstat
themes
Sites Sold by URA - Details3 Urban
Redevelopment
Authority
(URA)
Housing and
Urban Planning
Textual OneMap view of
site location
Table 1. Selected Datasets for Linked Data Conversion Study
The pilot datasets have been selected to demonstrate the capability of linked data to interlink domains (Urban planning
and population demographics) based on an application for business executives intending to purchase a new site listed
by Urban Redevelopment Authority (URA). The datasets from Urban Redevelopment Authority (URA) and Department
of Statistics (DOS) have been used to demonstrate the application. There is a possibility of missing some sources as the
analysis output has not been signed off by iDA.
3.0 METHODOLOGY
Figure 1: Research Methodology – Linked Data Migrational Framework Design for iDA
The activities in the project were classified under three phases – Analysis, Design and Evaluation.
3.1 ANALYSIS
The phase dealt with the collection of required information from the source parties iDA, SLA and NIIT through face to
face meetings followed by investigation of the current public data ecosystem of Singapore Government.
Singapore Government publishes its public data in a portal data.gov.sg (DGS) as a part of the eGov2015 initiative4
.
Infocomm Development Authority of Singapore (iDA) 5
maintains the portal. iDA collects data from different government
agencies. The data portal aims to meet Singapore public’s data needs and to establish a co-creative environment &
2 http://data.gov.sg/common/search.aspx?s=default&o=a&a=DOS&count=50&page=40
3 http://data.gov.sg/common/search.aspx?q=Sites+Sold+by+URA&a=URA&page=1
4 http://www.egov.gov.sg/egov-masterplans/egov-2015/vision-strategic-thrusts
5 http://www.ida.gov.sg/home/index.aspx
Page 6
efficiently connect to public. The key actors identified in the data.gov.sg system are agencies providing data, data
consumers (public), Infocomm Development Authority to interface & publish the data and the application developers
exploiting the data with useful applications for the public. The governance data is currently provided by seven ministries
and 50 agencies or statutory boards.
As in most countries, Ministries in Singapore govern over agencies thereby forming a natural hierarchy. For example,
Ministry of Transport governs over Public Transport Council, Land Transport Authority, Civil Aviation Authority,
Maritime and Port Authority. It has been identified that hierarchies have not been considered in the data.gov.sg
system’s user interface and hence becoming a downside for the data consumers in associating data from the same
domain under different end points. The existence and flow of data in the current data eco system has been studied with
iDA system design documents preceeded by discussions with Infocomm Development Authority (iDA), Singapore Land
Authority(SLA) and iDA IT vendors NIIT technologies staff working on the data.gov.sg (DGS) and OneMap6
projects.
3.1.1 Artifacts Analyzed
- Source Data forms
- Data transfer and storage in data.gov.sg (DGS) system
- Data upload and download specifications
- Destination data forms
- Existing APIs for data consumption
- Selected data sets for Linked Data conversion study
- Current data consumption points for the selected data
3.1.2 Source Data Forms
The data provided by the agencies exists either in textual or spatial form. The textual data from different agencies are
stored in SG DATA database maintained by Department of Statistics (DOS). The spatial data from agencies are
maintained in OneMap database maintained by Singapore Land Authority (SLA).
3.1.3 Existing API SERVICES for Data Consumption
DGS provides Application Programming Interfaces (API) to facilitate easy data consumption in customized applications
for the public and developer communities. OneMap API provided by Singapore Land Authority, Traffic-related APIs
from Land Transport Authority, Tourism-related APIs from the Singapore Tourism Board, Environment-related APIs
from the National Environment Agency and Library-related data feeds and web services from National Library Board are
the APIs provided by DGS.
3.1.4 Issues identifiedwiththe ExistingDGSSystem
1. Agencies do not provide raw data to Infocomm Development Authority. Aggregated report data from Agencies is
split into two dimensions and one factual field: - X dimensions representing columns, Y dimensions representing
rows and ‘data points’ field representing cells. These fields are provided in an XML file and sent to Infocomm
Development Authority on a periodic basis. There is no separate master data file. The hierarchy in master data
dimensions is not explicitly set or provided Therefore, a mechanism to identify the master data and the
relationship between different levels in the master data dimensions needs is not available. (Refer Appendix,
Section 1, Singapore Data Eco System, Data transfer and storage)
2. In data.gov.sg (DGS) database, the data is dumped into just two columns and the references are made with the
field ‘Metadata id’. There is an inherent lack in meaning about the stored data and its attributes (Refer Appendix,
Section 1, Singapore Data Eco System, Data transfer and storage). In most cases, there are issues with granularity
(For example, location is expressed in streets and roads in Urban Redevelopment Authority ‘sites for sale’ dataset,
whereas it is expressed as planning areas in the Department of Statistics ‘population trends’ data set).
6 http://onemap.sg/
Page 7
3. There is a conflict of Metadata standards in Textual and Spatial data, DGS and OneMap database (for example,
‘category’ has different set of values and classification in data.gov.sg7
and OneMap8
).
4. There is no master data management system that standardizes the master data values across agencies.
Standardization is required to link common data in the data sets.
5. The current Governmental Data ecosystem of Singapore provides multiple end points to the users such as API,
web services and files (Figure 2) making interlinking a complex process in application development (Refer
Appendix, Section 1, Singapore Data Eco System, Destination Data forms).
6. APIs available for data consumption at end-user level, restrict the data view and inter-linking possibility by fixing
the coverage area of a single API call to a single domain such as transport or health. OneMap API is an example
(Refer Figure 2 of the Methodology section).
7. The download/upload specifications of the XML files sent by individual agencies have redundancy in structure
therefore ending up with bigger file sizes (Refer Part VI – Appendix, 1, DGS Eco System, Data upload and
download specifications).
7 http://data.gov.sg/Metadata/SGMatadata.aspx?id=0706010000000007035O&mid=26883&t=TEXTUAL
8 http://www.onemap.sg/themeexp/themeexplorer.aspx
Page 8
Figure 2. Singapore Government Data Ecosystem
(A clear and larger view of the figure is shared in Google drawing board19 )
3.2 DESIGN
The construction of the framework was done in the Design phase. The devised steps encompass every activity involved in
the migration with clearly identified entry and exit points in each step. An inventory list of the Linked Data tools that can
be used in each step of the framework has been prepared by collecting information from literature and other web sources.
3.3 EVALUATION
The devised framework along with the recommended tools was evaluated with the pilot data sets. Three prominent RDF
Conversion tools (Google Refine, RDF Views and RDF Sponger) have been used for testing the migration of the data to
Linked Data format. The issues that iDA could encounter during the actual implementation was identified by this process
of evaluation of the tools and steps.
The Key objective is to provide a launch pad for iDA to get a first-hand view of the existing Linked Data tools in the
market. The mitigation measures for the issues identified in each step have also been discussed. Evaluation criteria, usage
Page 9
and issues encountered are summarized with appropriate screenshots. This is intended to save time for iDA during the
actual implementation.
Final steps after RDF creation provide less scope for evaluation in current research. There may be more issues
encountered with the increase in volume of data.
4.0 MIGRATIONAL FRAMEWORK
The migrational framework has been devised based on the SG Data eco system study and Linked Data literature review.
iDA can use the framework in an incremental/prototype mode so as to get an early view of the final output. The following
subsections represent each step explained using a Business Process perspective i.e. Input, Process and Output followed by
Issues,Recommended Tools and Suggestions (as in Figure 3).
Figure 3. Migrational Framework
4.1 STEP 1 - SPECIFICATION
Specification is a de-facto step for a project at enterprise or government level. The nature of this step varies at the
Program (iDA’s Linked Data Program) and Project (Agency Linked Data implementation). The activities in the step were
decided based on the study of the system and database specification documents of the data.gov.sg (DGS) to understand
the logical model connecting the tables that is used to generate data sets and interpret the required
There are two sub-steps in the process: - Identification (Program level) and Analysis (Project level).
4.1.1 SUB STEP 1 - IdentificationActivitiesatProgram Level
4.1.1.1 High Level Architecture
Figure 4 provides various migration options for the different type of sources. There are two types of data sources –
Structured and Unstructured. Structured data is from excel, csv, text (with header) files and RDBMS. Unstructured data is
Page 10
found in PDF, HTML and image files. We see usage of most methods from Figure 4 except CMS/RDFa, for iDA Linked
Data implementation.
Figure 4. High level Architecture of Linked Data (Hyland and Wood, 2011)
4.1.1.2 Data SetMigration Potential
The source data sets are to be classified for deciding the order of Linked Data implementation. The classification is based
on two measures 1) Utility level of data set and 2) Scope for interlinking with other data sets across domains. Based on
these two measures,a data set can be classified into high, medium and low potential levels.
Data set Data
set
URL
Data Type Agency Utility Level Interlinking Possibility Potential Level
Annual Vehicle
Population by
Type of Fuel
Use
URL Textual (PDF) LTA H L M
Administrative
Data -
Employment
Statistic
URL Textual
(HTML)
MOM H M H
After Death
Facilities
URL Spatial (KML) NEA L H H
Listing of
approved
Building Plans
URL Textual (XLS) BCA L H H
Table 2. Classification of data sets based on Migration Potential
4.1.1.3 Perspective
Perspective is a direction based factor for Linked Data implementation. There are two options (Vertical and Horizontal) -
Agency specific implementation where specific agencies such as National Environmental Agency or Department of
Statistics choose to convert their domain data. This is a vertical perspective. The second option is Application specific
implementation where the data used by specific applications are converted. This is horizontal perspective as the data for
Page 11
an application can be retrieved from different agencies based on the usage context. The recommended option is Agency
specific implementation, similar to the UK Linked Data implementation. For example, Linked Data implementation
could be started with SLA data.
4.1.1.4 DatasetLicensing
The next factor is the assignment of license for data sets. This often depends on the source of data. The license type is
based on the sensitivity and confidentiality of data. The access rights and means of data access are decided based on
license (Refer Table 4). – categorize the data set typesand assign the appropriate license
Few types of available licenses
1) The Open Database License (ODbL)9
2) Open Data Commons Attribution License10
3) The Creative Commons Licenses11
4) Public Domain and Dedication License12
Data Set Dataset
URL
Data Type Parent Agency Assigned
License
Access Rights Data Access
Modes
Annual
Vehicle
Population by
Type of Fuel
Use
URL Textual (PDF) LTA PDDL R API, SPARQL,
RDF Dump
Administrative
Data -
Employment
Statistic
URL Textual (HTML) MOM PDDL R API, SPARQL,
RDF Dump
After Death
Facilities
URL Spatial (KML) NEA ? R API,SPARQL
Listing of
approved
Building Plans
URL Textual (XLS) BCA CC R API, SPARQL,
RDF Dump
DGS Tables NA Textual(RDBMS) iDA NA NA API
Table 3. Classification of data sets based on License
For example, the first two data sets with open type licenses can release data on the web directly as RDF Dumps whereas
the other data sets might require restricting RDF access directly to users and provide just the SPARQL endpoint (A means
to access data in Linked Data) or through Application Programming Interface (API).
4.1.1.5 Implementation Method
Linked Data implementation method is a very important factor as it determines the overall effectiveness of the Linked
Data solution. We have identified three methods based on different usage scenarios and complexity level.
Method1 – Linked Data only (just URI linking), the second quickest mode of implementation enables re-use of instances
across agencies. It can be used for simple publishing as there will not be many RDF triples to explain the resource. An
9 http://opendatacommons.org/licenses/odbl/
10 http://opendatacommons.org/licenses/by/
11 http://creativecommons.org/
12
http://opendatacommons.org/licenses/pddl/1-0/
Page 12
example is http://dbpedia.org/page/Jurong_Camp_I. Disadvantage is the additional work needed to be done for getting
actual data about the URIs from the user and system. iDA can use this method to test the URI management process of
Centralized (DGS) vs. Decentralized (Agency controlled).
*Recommended* Method 2 - Linked Data + RDF,the most comprehensive use of Linked Data requires iDA to do
additional work to create RDF data for each resource. An example of this type of implementation is
http://dbpedia.org/page/Singapore. Therefore, it needs extensive storage. iDA can use this method to completely realize
the benefits of Linked Data and Semantic Web standards. A decision to use this mode can be taken after evaluating the
first prototype/POC application
Method 3 – Linked Data for files (URIs for files only), the quickest mode of implementation can help in Search and
Retrieval operation by offering rich metadata for files in the system. There is a lack of interlinking between data sets and
the applicability of this method is just for files and it is not at the data level. iDA can use this method to improve the
discovery of file based data sets in DGS website.
4.1.2 SUB STEP 2 - ANALYSIS ACTIVITIES AT PROJECT LEVEL
System specifications, design & integration documents (including database) of the selected data sets are to be collected for
studying Metadata, schema design and Entity Relationship (ER) models. We performed this exercise by going through the
documents provided by iDA. The design issues were ascertained and overall understanding of the system was achieved.
Sub-step Input Process Output
Identification
(Program)
 Budget
 Program Duration
 Objectives
 Set of available
data sets
 Data set License Assignment
 Implementation method
selection
 Architecture Design
 Implementation Perspective
 Classification of data sets by
Migration Potential
 Project Roadmap
 High Level Architecture
 Implementation
Perspective (Verticalor
Horizontal)
 Classified set of data sets
by Migration Potential and
License
 Implementation method
Analysis
(Project)
 Specifications
 Relational models
 ER models
Analysis Overview of data set
Design issues
Table 4. Overview of Specification Step
4.1.3 IDENTIFIED ISSUES
1) The improper licensing of data sets can lead to illegal usage and leakage of sensitive and critical information (eg: crime
and abortion rates in Singapore).
2) Future public data Information System projects need to be planned with the possibility of having a Linked Data
downstream option. Improper planning leads to changes in architecture and affect project timelines.
4.1.4 RECOMMENDED TOOLS
There are no Linked Data specific tools recommended for this step due to its strategic nature. We used tools such as MS
Project, MS Visio for the planning and designing purpose.
4.2 STEP 2 - OBJECT MODELING
Object Modeling is aimed at identifying the objects (elements in data sets) & their relationships in the data sets. This
exercise is usually carried out with help from domain experts of the respective government agencies. Objects from the
data sets (eg: Location, Building Type, Race and Area) should be drawn on a whiteboard and lines should be drawn to
indicate the relationships between the objects (eg: Locations fall under Zone, population falls under demographic data
Page 13
etc). The entire exercise is to be done with coverage on mind and not based on the futuristic uses of the applications. It is
recommended to normalize the model in 3NF (Third Normal Form) before starting the exercise.
We drew the conceptual view of the data set objects in the whiteboard (as in Figure 5). The key learning from this
experience was the relative ease in identifying the use of common classes between the data sets (ex: the use of Location
object) and brainstorming the relationship between objects from different data sets.
Figure 5. Object Modeling done by the Project Team
Sub-step Input Process Output
Object Modeling (for individual
data sets and between data sets)
 Overview of data
sets
 Relational model
of data sets
Drawing objects and
linking them in
whiteboard
Conceptual view of all objects in the
system with their interconnections
Table 5. Overview of Object Modelling Step
4.2.1 IDENTIFIED ISSUES
1) There might be a tendency to apply high abstraction over some objects and concurrently apply high granularity to
others objects. For example, middle levels in the zone hierarchy can be missed and road level data can be mistaken as
classes.
2) Hastened closure of this exercise can result in poor design. The output diagram should be signed off by domain experts
of the concerned government agencies.
4.2.2 RECOMMENDED TOOLS
Concept map13
is apt tool for object modelling. We recommend the usage of whiteboard for this purpose even though
electronic drawing of the objects and their relationships can also be done.
4.3 STEP 3 - ONTOLOGYMODELING
The interlinking of data within a data set and between data sets is achieved by modelling the abstract entities close to the
real world into classes, sub classes, with their data properties and instances. The structure is called ontology or concept
vocabulary. The efficiency of these ontologies lies in reusing data across different data sets rather than recreating and
constraining the use of data with a significant degree of structure. In the current data.gov.sg (DGS) system, the data from
the two pilot data sets are used in separate contexts as singstat visualizations in charts and Urban Redevelopment
Authority (URA) site map. A common use case scenario across domains is provide below:-
13 http://cmap.ihmc.us/publications/researchpapers/theorycmaps/theoryunderlyingconceptmaps.htm
Page 14
“Consider an industrial entrepreneurintending to buy a site fromUrban Redevelopment Authority (URA) to start hisnew
factory.He might be interested to look at other aspects of the site along with the data provided by URA like plot ratio and
development type, to validate purchase option thoughts instantaneously,say the population trends - age group,ethnicity,
living standard, employability and even the location trends – transportation available to the site location,nearest
expressway,distance to the nearest port (air and sea) for shipment.”
The object models of the data sets developed in the previous step was used to study the higher level relationships within
data. In the population trends data, the following hierarchies and relationships are identified. The population
demographics have the examining factors (Age, race, location, and gender or employment status). The location is a
region, planning area or a street location. The dwelling types are HDB, Condominiums, Private Flats, or Landed Property.
One of the object properties of planning area is ‘has Site Location’. This property carries an inverse relationship with the
property of site location - ‘under Planning Area’. The URA site for sale data is organized based on the sites (site location
as classes). The sub class identified is the type of development code, in turn has a sub class, ‘Type of development
allowed’. (Refer Section 1.0 in Companion Document 1)
The common field that looked inter-connectable in these datasets is ‘Location’ field. Ontology was developed to establish
the relationship between the data sets based on the Location data. The existing vocabulary Geonames14
is reused. The data
in different granularity levels is a challenge for standardization. Urban Redevelopment Authority (URA) sites for sale data
contains Location at specific site location level only. Department of Statistics (DOS) Population trends data set contains
both at planning area and specific site location levels. Geonames vocabulary contains classes at postal code level.
A standardized Location vocabulary is derived with the site hierarchy provided by Urban Redvelopment Authority (URA)
and mapped to both the data sets extended to postal code levels and reusing the geonames vocabulary (URI). The country
has regions Central, West, East, North and North East. It has 12 urban planning areas (eg: Clementi, Boon lay,
JurongWest). The urban planning area has site location like Pioneer Road North, Woodlands Avenue. The site location
has a postal code (ex: 628462 for pioneer road north).
Networking ontologies and extending the relationships across domains improves the machine reasoning and intuitive
information retrieval from the user’s perspective as seen as above.
Sub-step Input Process Output
Ontology Modeling
(for individual data
sets and between
data sets)
 Individual data sets(URA and
DOS data sets)
 Object models developed in the
previous step
Creating explicit
relationships with
classes,sub classes,
object and data
properties on the
domain specific
information (Refer
Appendix, Section 5,
Figures 10, 11 & 12).
Ontology linking urban
planning and
population domain data
sets (Refer Appendix,
Section 5, Figure 13)
Table 6. Overview of Ontology Modelling Step
4.3.1 ISSUES
1) Resolving conflicting vocabulary and Metadata standards in data.gov.sg (DGS) and OneMap as explained in the
issues with the existing system could be a challenge.
2) There is existence of different levels of granularity in data sets. For example, Location is expressed in different forms
in different instances – urban planning area, specific site location or postal codes. Building a common location
vocabulary to network these different granular data can be used to uniformly drill down or roll up the location
information chain in applications with a single version of truth.
14 http://www.geonames.org/ontology/documentation.html
Page 15
4.3.2 RECOMMENDED TOOLS
Neon Tool Kit is recommended for design, storage, version maintenance and deployment of large ontologies
(Untested). It has an option to import the standard reusable vocabularies automatically and is best suited at enterprise level
for its ease of use in interfacing vocabularies and visualizing the ontology graphs with defined links and hierarchies in a
single shot as flash video or image. Protégé, an open source tool can be used for prototyping.
4.4 STEP 4 - URI NAMING STRATEGY
Uniform Resource Indicators (URIs) are the building blocks of RDFs. URIs provide names to the resources in the system.
The aspects/factors related to this step are provided below.
4.4.1 SELECTION OF URI ADMINISTRATION MODES
We have identified three URI administration modes. They are provided below:-
1) *Recommended* Maintained centrally in the DGS platform (resultant URIs will start with http://data.gov.sg/)
2) Maintained by individual agencies (resultant URIs will start with http://ura.gov.sg or http://sla.gov.sg).
3) Maintained externally by third party platforms such as Kasabi(resultant URIs will start with http://data.kasabi.org) –
No longer valid as Kasabiservice has been shut down.
4.4.2 URI TAXONOMY
A key decision area is the taxonomy design of the URI i.e. structure of the URIs. Most government Linked Data
systems use the TBOX and ABOX approach for differentiating classes and their instances. TBOX refers to the resource
URIs (http://data.gov.sg/resource/) and ABOXrefers to Ontology URIs (http://data.gov.sg/ontology/)
We are of the view that URI minting process would be challenge for iDA if it goes for a piece-meal approach i.e. scattered
implementation of Linked Data. URI construction should be done as a holistic process with good foresight on the future
prospects of the system. iDA can arrange a workshop with all government agencies to educate and look for ideas
pertaining to the representations that the agencies would want. The output of the workshop would be the taxonomy of the
URI structure. We provide some recommended URI patterns for implementation as in Table 11.
Table 7. URI Patterns for Singapore Data
Sub-step Input Process Output
URI
Naming
 List of resources
 List of Classes
and Properties
Visualization of URI minting process from
Linked Data tools and usage perspective
 URI Administration
method
 Base URI
 Taxonomy of URI
structure
Table 8. Overview of URI Naming Step
ABOX TBOX
http://data.gov.sg/ontology/Ministry/ http://data.gov.sg/ministry/MOH
http://data.gov.sg/ontology/Agency/ http://data.gov.sg/agency/SLA
http://data.gov.sg/ontology/SiteLocation http://data.gov.sg/location/pioneer_road_north
http://data.gov.sg/ontology/Race http://data.gov.sg/race/chinese
Dataset ID URAstaticfile001
Dataset http://data.gov.sg/dataset/ URAstaticfile001/
Class http://data.gov.sg/terms/class/URAstaticfile001/sitesforsale
Property http://data.gov.sg/terms/property/URAstaticfile001/time
Row 1 http://data.gov.sg/dataset/URAstaticfile001/1
Row 1 - A generic column http://data.gov.sg/dataset/URAstaticfile001/1/columnName
Dataset URIs
Page 16
4.4.3 IDENTIFIED ISSUES
1) Usage of different tools can hamper URI naming process if the selected RDF conversion tools do not support custom
coding of URIs.
2) Dead links can affect applications that use the URIs. This can be monitored by using Vapour15
, a URI validation tool
recommended by Pedantic Web Group16
.
4.4.4 RECOMMENDED TOOLS
The URI naming strategies of (Hendler, 2012)17
, (Vila-Suero, 2012)18
and (Davidson, 2009)19
provide good guidelines.
Pubby20
is a Linked Data tool that can be used for external URI dereferencing.
4.5 STEP 5 - RDF CREATION
RDFs are the building blocks of Linked Data systems. This step is about the conversion of data from source systems
format to Linked Data format. We have split the conversion process into three categories based on the data source type: -
Static files conversion (S2R), Relational Database Management System (RDBMS) data conversion (D2R) and API
data conversion (A2R). Examples of the different source formats are given in the below Table 13 and Figure 5.
Type Example of Singapore data sets Source format
S2R URA Site for Sales, Singstat’s Population
Household Characteristics
XLS, CSV, TXT files and other static
files
D2R DGS tables RDBMS
A2R OneMap API,myTransport API,NLB web
services
Application Programming Interface
(API) and Web Services(SOAP,REST)
Table 9. Conversion Types with Examples
15 http://validator.linkeddata.org/vapour
16 http://pedantic-web.org
17
http://logd.tw.rpi.edu/instance-hub-uri-design
18
http://www.w3.org/2011/gld/wiki/223_Best_Practices_URI_Construction
19
http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector
20 http://www4.wiwiss.fu-berlin.de/pubby/
Page 17
Figure 6. RDF Conversion Types and Endpoints for Singapore Public Data
4.5.1 STATIC FILES CONVERSION (S2R)
Government agencies predominantly publish their data (about 80% of total government data sets) in static file format (ex:
Excel spreadsheets) with internal representation in the form of tables or crosstabs. These files are to be loaded into Google
Refine21
for transforming fields in the data set to RDF triples. Google Refine performs three core functions and one
optional function a.) Correction of errors in data sets b) Creation of new fields c) Mapping fields to Ontologies d)
Reconciliation with public vocabularies and data sets (optional). We used the URASiteSales excel file for evaluating
Google Refine. Details of the transformations implemented, screenshots and output files are provided in Section 2.0
Companion Document 1
4.5.2 RELATIONAL DATABASE CONVERSION (D2R)
We recommend the usage of STDTrip methodology22
devised by the Brazil Open Data team as the first step in the D2R
conversion process. STDTrip gives six distinct steps that aid in mapping RDBMS model to Ontologies/Vocabularies. The
crux of this process is to use ER diagram of DGS data model to get the corresponding classes and their properties. This
process has some of its footing in Ontology Modelling step.
Virtuoso Universal Server’s RDF Views23
is recommended for D2R conversion. The mapping files generated from the
STDTrip process is to be converted in RDF Views syntax. The RDF conversion can be done on a daily basis (after the
refresh of DGS base tables) or on need basis. We created a table directly in Universal Server for evaluating RDF Views
and its mapping features. Execution steps, screenshots of the RDF Views tool are provided in Section 3.0 Companion
Document 1
4.5.3 APPLICATION PROGRAMMING INTERFACE CONVERSION (A2R)
RDF Sponger24
(another feature of Virtuoso Universal Server) was evaluated for A2R conversion. The A2R conversion is
real-time in nature. API services provided by SLA (OneMap API), LTA (myTransport API) and NLB (web services) can
be authorized in the Universal Server with authentication keys. This is followed by creation of external cartridges
(mapping files) for extracting and mapping the contents of API output to corresponding classes and properties in the
vocabularies used for the data sets. Most of the APIs provide data in the form of JSON 25
(OneMap) or XML
(MyTransport) or plain text format. Separate cartridges are to be created for each of these content types. A sample
OneMap API was used to test this functionally. Details of the OneMap API used and screenshot from RDF Sponger have
been provided in Section 4.0 Companion Document 1.
Sub-step Input Process Output
STDTrip
Process
 Relational Model
 ER Model
 Identified
Vocabularies
Seven step STDTrip process for generating mapping
files
Mapping and Schema
files
S2R  Spreadsheets(xls, csv)
 Other text formats
 Cleansing
 Transformation
RDF Triples
D2R  SQL of tables
Mapping files
Conversion from table format to RDF RDF Triples
A2R  API URLs
 Mapping files
 Identification of API content type
 Conversion to RDF
RDF Triples
Table 10. Overview of RDF Creation Step
21 http://code.google.com/p/google-refine
22 STDTrip Methodology http://www.inf.puc-rio.br/~casanova/Publications/Papers/2010-Papers/2010-SBBD-StdTrip.pdf
23 http://virtuoso.openlinksw.com/whitepapers/relational%20rdf%20views%20mapping.html
24 http://virtuoso.openlinksw.com/presentations/Virtuoso_Sponger_1/Virtuoso_Sponger_1.html
25 http://www.json.org/
Page 18
4.5.4 IDENTIFIED ISSUES
1) Issues in this step are mainly related to data quality, versioning, source system unavailability, URI management and
non-adherence to mapping rules. Existing Linked Data literature on handling updates to existing data are not extensive
and iDA would be best advised to devise a strategy that tests the prototype with multiple real-time updates (eg: traffic
related data of LTA).
2) The real-time nature of A2R makes it completely dependent on source APIs. Absence of intimation about API outages
can cause the system to return null or invalid results (if caching is enabled).
4) Google Refine doesn’t dereference the URI that it creates for each row in the source file. Dereferencing is to be done
externally. Pubby26
provides this functionality as a solution.
5) Changes to SQL queries, updates to DB tables and changes to API output are to be indicated to the appropriate
personnel before implementation. Mapping files are supposed to be updated before the aforementioned changes go into
production.
4.5.5 RECOMMENDED TOOLS
The complete list of tools for RDF creation step is available in Section 2.2 of Companion Document 2. Selection of the
tool is largely dependent on the inter-operability features. We recommend Virtuoso Universal Server for its completeness
for both D2R & A2R and Google Refine for S2R process because of its support for complex custom scripting that is
needed while transforming data from its source format to the required format(eg: multidimensional nature of statistics
related data provided by Department of Statistics).
4.6 STEP 6 - EXTERNAL LINKING
Linked Data is known for its prescription of four key principles (http://www.w3.org/DesignIssues/LinkedData.html).
Even though, the adoption of fourth principle is optional for governments and enterprises, the web of data vision can be
brought forth only by linking data sets across the globe (web of data depicted by LOD cloud27
). Current DGS system
doesn’t make use of external data or provide private firms data. Linked Data applications would be greatly benefited on
the existence of links between government data sets and popular data hubs such as Dbpedia (Wikipedia) and Freebase. We
have identified some scenarios of linking Singapore data sets with external data sets from an application development
perspective:-
1) Wikipedia for use of textual content pertaining to places, events and entities in Singapore
2) Geonames28
for use of spatial data
3) Flickr wrapper29
for use of openly licensed images pertaining to places and events
4) World Bank30
and CIA Factbook31
data sets to facilitate economic analysis of Singapore and other nations on common
dimensions
5) UN data sets such as FAO32
data sets for comparing import and export of commodities in Singapore
6) Supreme Court database (SCDB) with Dbpedia and News agency like LosAngeles Times and NewYork Times for
analyzing the decisions made33
.
We recommend data set owners to look at the open government data catalogue34
and datahub35
for potential data sets to
identify useful data sets that could be linked to SG data sets. The identified data sets can be compared with the
26 http://www4.wiwiss.fu-berlin.de/pubby/
27 http://richard.cyganiak.de/2007/10/lod/
28 http://linkedgeodata.org/
29 http://www4.wiwiss.fu-berlin.de/flickrwrappr/
30 http://worldbank.270a.info/
31 http://www4.wiwiss.fu-berlin.de/factbook/
32 http://data.fao.org/
33 http://www.youtube.com/watch?v=hawT84z0U30
34 http://opengovernmentdata.org/data/catalogues/
Page 19
government data set using tools such as SILK36
and LIMES37
that automate the linking process between data sets. The
linking is not contextual instead it is made between semantically related resources (use of the OWL property owl:sameas).
Most of the links will be made to the data sets of DBpedia and Geonames as these services are regarded as good hubs of
general factualand spatial data respectively. An example is provided below.
<http://data.gov.sg/location/bugis> <owl:sameAs> <http://www.dbpedia.org/resource/Bugis>
<http://data.gov.sg/race/chinese> <owl:sameAs><http://www.dbpedia.org/resource/Malay_race>
Sub-step Input Process Output
External
Linking
 Demographic and Spatial data sets
(provided by DOS and SLA)
 Identified external data sets (DBpedia,
Geonames, CIA Factbook)
Linking suggestions provided
based on Similarity algorithms
Outbound links to
external data sets
Table11. Overview of External Linking Step
4.6.1 IDENTIFIED ISSUES
1) The outbound links made to data sets outside of iDA’s purview can be risky based if the external source’s data
credibility is brought into question(ex: Wikipedia is built on user content and can be openly edited).
2) Dead links can arise during the change of resource URIs or system downtime. There is an alternate solution of
importing the RDF dump from the external data sets for local storage and use.
4.6.2 RECOMMENDED TOOLS
SILK41
and LIMES42
are industry standard tools that can automate the process of finding links based on similarity
measures. We recommend SILK based on its wide usage.
4.7 STEP 7 - DATASETS PUBLICATION
Datasets Publication happens at the end of the RDF creation step. The elements related to this step are described below.
4.7.1 TRIPLE STORE
The generated RDF triples are stored in a Linked Data specific data structure called as Triple Store or RDF Store. Storage
of triples can be centralized at DGS server or decentralized at servers of individual agencies. A centralized storage is
recommended to avoid different versions of truth. The Figure 6 in this section depicts a common triple store for storing
the triples from the three conversion processes.
4.7.2 METADATA PUBLICATION
Metadata compliments actual data by providing semantics about the fields and the data set. Therefore, metadata
publication is needed along with actual data publication. Literature recommends the usage of VOID38
vocabulary for this
purpose since it covers general metadata,access privileges metadata and external links metadata. The triples applicable for
the pilot dataset are given in Section 5.0 of Companion Document 1. Metadata publication is a key differentiator of
Linked Data on comparison to traditional Information systems as the triples are created with the purpose of reducing
the learning time for the developer therefore accelerating the application development process.
35 http://thedatahub.org/
36 http://www4.wiwiss.fu-berlin.de/bizer/silk/
37 http://limes.sourceforge.net/
38 http://vocab.deri.ie/void
Page 20
4.7.3 SPARQL AND LINKED DATA API
SPARQL39
is to Linked Data in the same vein as SQL is to Relational Database based Information systems. RDF data can
be queried only through SPARQL. The biggest advantage of SPARQL is the CONSTRUCT query that can generate new
triples based on the characteristic of inference. For example, a set of triples can be generated between Regions (Central,
North) and ‘Cumulative Site Tender Prices’ based on the common existence of ‘Location’ resource/field in two data sets.
SPARQL may not find acceptance right away from developers having experience in using SQL and REST APIs.
Therefore, an API wrapper called as Linked Data API (LDA)40
can be stationed on top of SPARQL endpoint as a suitable
means of abstraction. Linked Data REST API translates the API call to SPARQL query, retrieves the resultant data from
SPARQL and then translates the data to an appropriate output format such as JSON, XML, HTML or TEXT. Data
extraction from Data.gov.uk is based on this concept. The URL http://gov.tso.co.uk/transport/api/transport/doc/bus-stop-
point.xml retrieves a list of bus stops in UK in XML format. A sequence based illustration of Linked Data API in DGS
context is given in Figure 7. The use of LDA brings in obvious advantages as it simplifies data access and provides
convenient output formats. This also means that the rules for conversion from API call to SPARQL are to be created. We
foresee considerable time and effort to be invested in this activity as even a small change in API parameters would need a
different SPARQL query.
The use of named graphs41
is critical for iDA as updates to existing RDF triples are made in the triple store. A DML query
can have unexpected results if the count of triples that are to be updated is unknown. Therefore, the use of named graphs
is suggested. API based updates and deletion can be a preferred mode avoiding the complex nature of SPARQL queries.
Linked Data API call
http://data.gov.sg/lda/
childcare/north
SPARQL Query
Select ?cc
Where {
?cc dgs:haszone dgs:north.
?cc dgs:facilitytype dgs:childcare.
}
LIMIT 100
Triple
Store
LDA-SPARQL
Mapping file
Conversion
from RDF to
JSON
RDF Triples
Http://data.gov.sg/facility/cc/name1
Http://data.gov.sg/facility/cc/name2
Http://data.gov.sg/facility/cc/name3
.
.
.
Http://data.gov.sg/facility/cc/name100
JSON Output
Entry: name1
Entry: name2
.
.
.
Entry: name100
Figure 7. Linked Data API workflow for iDA
4.7.4 LINKED DATA EXTERNAL HOSTING
Linked Data hosting is another option that can be considered by iDA if it aims to use third party platforms for data
storage. Hosting in cloud brings in cost benefits, abstracts the complexity of maintaining servers & installing software and
provides capability for instantly creating simple REST42
based APIs for data insertion, update, deletion & retrieval
activities. We evaluated the Kasabi43
platform (powered by Talis44
) and found the interface to be intuitive. Screenshots are
provided in Appendix Section 7). LOGD TWC portal (http://logd.tw.rpi.edu/) is maintained by Rensselaer Polytechnic
Institute and it provides Linked Data conversion service for US government data sets. iDA can consider this option based
on from the hosting vendor. UK Linked Data also makes use of Talis platform. Important Note: Kasabi service has been
shut down as of July 2012
39 http://www.w3.org/TR/rdf-sparql-query/
40 http://code.google.com/p/linked-data-api/
41 http://www.w3.org/2004/03/trix/
42 http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
43 http://kasabi.com/
44 http://www.talis.com/
Page 21
Sub-step Input Process
Output
Actual data
publishing
 S2R RDF triples
 D2R RDF triples
 A2R RDF triples
Data insertion through
command line or in batch
Data storage in triple store
Metadata
publishing
 Vocabulary classes
 Key Statistics from
data set
Modeling in VOID VOID triples
SPARQL  SPARQL queries Data retrieval or graph
construction
Resultant RDF triples
Linked Data
API
 REST based API Calls  Conversion of API call to
corresponding SPARQL
query
 Conversion of resultant
RDF triples to output
format(ex: JSON)
JSON/XML/HTML/TXT data
Table 12. Overview of Dataset Publication step
4.7.5 IDENTIFIED ISSUES
1) SPARQL does not currently support sub-queries and many other database operations that application developer make
use of. The only alternative is to build complex SPARQL queries that do the same operation. This could lead to a long-
winded process of updating queries for even small change requests. A possible solution can be SPARQL query
creation/update template (in excel) that could aid the developers in accelerating the query creation process.
2) Linked Data API with its simplicity actually eliminates the biggest advantage that Linked Data provides - inferencing
through SPARQL CONSTRUCT query. Therefore, we recommend that a public SPARQL endpoint be maintained, to
facilitate innovative querying.
3) A security lapse or an unknown exploit in Linked Data hosting environment can lead to serious implications if Linked
Data is not open for public use. This risk always remains with external hosting. Evaluation of Vendor’s backup measures
is important before agreeing on the contract.
4.7.6 RECOMMENDED TOOLS
Virtuoso Universal Server is recommended for data publication and SPARQL endpoint. Puelia45
, PHP based Linked Data
API library is recommended for API. If iDA plans to use Java for Linked Data API, Elda46
can be considered. Talis
platform is recommended for Linked Data hosting.
Apache Jena Framework47
(initially developed by HP Labs) provides Java based APIs for dealing with most of the
elements in the Semantic Web stack (RDF, Ontology and SPARQL). We used the SPARQL API for querying publicly
available SPARQL endpoints. The framework is secondarily recommended. The java files and usage steps are available in
Section 6 of Companion Document 1.
4.8 STEP 8 - EXPLOITATION AND DISCOVERY
iDA’s implementation of Linked Data over its existing data platform would be considered successful based on the factors
such as progressive availability of new applications created using multi-agency data sets, number of API calls and number
of new ideas implemented as applications. A government service provider would want the general public and enterprises
to exploit its service offering and the same applies for its Linked Data offering. iDA had started the ideas4apps initiative
45 http://code.google.com/p/puelia-php/
46 http://elda.googlecode.com/hg/deliver-elda/src/main/docs/index.html
47 http://incubator.apache.org/jena/
Page 22
to look for new ideas that could be implemented as webapps or mobile apps on top of government data. Such initiatives
will have more impact with the data sets released in Linked Data format. The two types of discovery are as follows:-
1.) Internal discovery within Singapore for local citizens
2.) External discovery for attracting usage of Singapore government data in international economic and political research
Sub-step Input Process Output
Exploitation &
Discovery
 Linked Data
 Linked Data Existing
Applications
 Gamification
 Crowdsourcing
 Converting Ideas to Apps
(Related idea Refer
Appendix, Section 2,
Figure 7)
 Registering in public
catalogues
 New Applications(ex:
Eagle Eye) Refer
Appendix, Section 3,
Figure 9
 External reference from
LOD cloud
Table 13. Overview of Exploitation & Discovery Step
Figure 8 is representative of the final end product of POC. The middle box containing the site details for Pasir Ris is from
the parent data set of URA. The box in the right contains demographic details of Pasir Ris, retrieved from the DOS data
sets and the box in the left contains area facilities available in Pasir Ris, provided by SLA through OneMap API. The
display of these two boxes with data coming from different agencies is made possible by using a common URI for Pasir
Ris and by re-using this URI across three different data sources.
Figure 8. Post Linked Data Migration – A snap shot of an application (After interlinking urban planning and population
domains)
Another application prototype designed for Singapore Data sets is listedin Appendix, Section 7.2
4.8.1 IDENTIFIED ISSUES
There are no technical issues related to this step as it is majorly related to publicity of the Linked Data system supported
by iDA. Technical flaws, data quality issues and application level misrepresentations are the possible issues that could
cascade from the earlier steps. Faceted browsing of Linked Data can cause serious performance issues as data retrieval is
quite huge for this purpose. Public availability of this service can be restricted even though it is one of the best ways of
slicing and dicing data for getting to the required point of satisfaction in knowledge discovery.
Page 23
4.8.2 RECOMMENDED TOOLS
Linked Data browsers Sig.ma48
, Tabulator49
and OpenLink Data Explorer50
can be used over Singapore data sets for
faceted browsing.
5.0 CONCLUSION
This report’s main contribution is its applicability to the Singapore Government data. iDA can use this report as a
reference to plan their strategy for Linked Data implementation that seems imminent in the near future. The report shows
the method of interlinking two data sets sourced by two different government agencies along with the steps involved in
using the recommended tools (Google Refine, RDF Views and RDF Sponger) with these data sets. The identified issues
and recommendations in each step will be highly useful during iDA’s actual implementation. The work is incomplete in
the view that a POC application was not built using the framework. Therefore, we foresee the possibility of unidentified
data maintenance issues propping up during implementation. The experience of learning about new technologies, new
methods of data representation, working through the case studies of other countries and contemplating on selection of best
tools, was enriching for us
6.0 REFERENCES
Bauer,F., & Kaltenbock, M. (Eds.). (2012). Linked Open Data: The Essentials: A Quick Start Guide for Decision Makers.
edition mono/monochrom. Retrieved from http://amazon.de/o/ASIN/3902796057/
Berners-Lee,T.,Hendler, J., & Lassila, O. (2001). THE SEMANTIC WEB. Scientific American,284(5), 34
Berners-Lee, T. (2006). Linked Data. Available: http://www.w3.org/DesignIssues/LinkedData.html. Last accessed
11th Jan 2012
Bizer , C., Heath, T., Idehen, K., & Berners-Lee, T. (2008). Linked Data: Evolving the Web into a Global Data Space. (J.
Hendler & F. Van Harmelen, Eds.)Proceeding of the 17th international conference on World Wide Web WWW
08 (Vol. 1, p. 1265). ACM Press.
Blumauer, S. (2012). Open Data for Enterprises. Available: http://www.slideshare.net/ABLVienna/open-data-for-
enterprises-12126124. Last accessed 20th April 2012
Chee Hean, T. (2011). Keynote Address by Mr Teo Chee Hean, Deputy Prime Minister, Coordinating Minister for
National Security and Minister for Home Affairs at the e-Gov Global Exchange 2011. Available:
http://www.ida.gov.sg/News%20and%20Events/20110620114104.aspx?getPagetype=21. Last accessed 11th Jan 2012
Davidson, P. (2009). Designing URI Sets for the UK Public Sector.Available: http://www.cabinetoffice.gov.uk/resource-
library/designing-uri-sets-uk-public-sector. Last accessed 20th April 2012.
DiFranzo, D.,Graves, A., Erickson, J. S., Ding, L., Michaelis, J., Lebo, T., Patton,E., et al. (2011). The Web is My Back-
end: Creating Mashups with Linked Open Government Data. In D. Wood (Ed.), Linking Government Data (pp. 205-
219). Springer New York. Retrieved from http://dx.doi.org/10.1007/978-1-4614-1767-5_10
Dodds, L., & Dodds, L. (2011). Linked Data PatternsA pattern catalogue formodelling , publishing , and consuming
Linked Data and consuming Linked Data. Retrieved from http://patterns.dataincubator.org/book/
Haase, P., Rudolph, S., Wang, Y., Brockmans, S., Palma, R., Euzenat, J., & d’ Aquin, M. (2006, November). Networked
Ontology Model.TechnicalReport, NeOn project deliverable D1.1.1
48 http://sig.ma/
49 http://www.w3.org/2005/ajar/tab
50 http://ode.openlinksw.com/
Page 24
Halb, W., Raimond,Y.,Hausenblas,M . (2008). Building Linked Data For Both Humans and Machines.. WWW 2008
workshop : Linked Data On the Web.
Heath,T., & Goodwin, J. (2011). Linking Geographical Data for Government and Consumer Applications. In D. Wood
(Ed.), Linking Government Data (pp. 73-92). Springer New York. Retrieved from http://dx.doi.org/10.1007/978-1-
4614-1767-5_4
Hyland, B. and Wood, D. (2011). The joy of data - a cookbook for publishing linked government data on the web linking
government data. In Wood, D., editor, Linking Government Data, chapter 1, pages 3-26. Springer New York, New
York, NY.
Hyland, B. (2012). Linked Data Cookbook. Available: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook. Last
accessed 20th April 2012.
Jim Hendler. (2012). URI Design Principles: Creating Unique URIs for Government Linked Data. Available:
http://logd.tw.rpi.edu/instance-hub-uri-design. Last accessed 14th April 2012.
Miller, A. (2011). Releasing Relational Data to the Semantic Web.Available:
http://www.slideshare.net/alexmiller/releasing-relational-data-to-the-semantic-web-7634727. Last accessed 20th
April 2012.
Passant,A.,Laublet, P.,Breslin, J. G., & Decker,S. (2010). Enhancing Enterprise 2.0 Ecosystems Using Semantic Web
and Linked Data Technologies:The SemSLATES Approach. In D. Wood (Ed.), Linking Enterprise Data (pp. 79-
102). Springer US. Retrieved from http://dx.doi.org/10.1007/978-1-4419-7665-9_5
Pollock, J.T. (2009). Semantic Web for dummies. Hoboken,NJ: John Wiley.
Quarteroni, S. (2011). Question Answering, Semantic Search and Data Service Querying. Proceedings of the KRAQ11
workshop (pp. 10-17). Chiang Mai: Asian Federation of NaturalLanguage Processing. Retrieved from
http://www.aclweb.org/anthology/W11-3102
Salas, P.,Viterbo, J.,Breitman, K., & Casanova,M. A. (2011). StdTrip: Promoting the Reuse of Standard Vocabularies in
Open Government Data. In D. Wood (Ed.), Linking Government Data (pp. 113-133). Springer New York. Retrieved
from http://dx.doi.org/10.1007/978-1-4614-1767-5_6
Shadbolt, N.,Hall, W., Berners-Lee,T. (2006). The Semantic Web Revisited . IEEE Computer Society. 21 (3), 96-101.
Sheridan, J., Tennison, J.,. (2010). Linking uk government data.WWW2010 workshop: Linked Data on the Web
(LDOW2010), ACM
Sinclair, P. (2010). Building Linked Data Applications. Available: http://www.slideshare.net/metade/building-linked-data-
applications. Last accessed 20th April 2012.
Villazón-Terrazas, B., Vilches-Blázquez, L., Corcho, O., and Gómez-Pérez, A. (2011). Methodological guidelines for
publishing government linked data linking government data. In Wood, D., editor, Linking Government Data, chapter 2,
pages 27-49. Springer New York, New York, NY.
Vila-Suero, D. (2012). Best Practices URI Construction. Available:
http://www.w3.org/2011/gld/wiki/223_Best_Practices_URI_Construction. Last accessed 20th April 2012.
Tennison, J. (2010). Guest Post: A Developers' Guide to the Linked Data APIs - JeniTennison. Available:
http://data.gov.uk/blog/guest-post-developers-guide-linked-data-apis-jeni-tennison. Last accessed 20th
Page 25
Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority,
Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0, 28-Sept-11
Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority –
Admin Portal,Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0,28-Sept-11
Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority –
Database Specifications,Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0, 28-Sept-11
Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority –
SystemIntegration,Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0, 28-Sept-11
iDA, (2011, September). SG-Data Singapore Government Data Management Specifications – Data file specifications,
Version 2.2, July-11
7.0 APPENDIX
7.1 SINGAPORE DATA ECO SYSTEM – EXTENDED ANALYSIS
7.1.1 Data transferand storage in data.gov.sg (DGS) system
The data and metadata is imported to Data.Gov.Sg (DGS) data store from SG DATA database through an XML interface
every day. The Spatial data from different agencies are stored in OneMap database. Data.Gov.Sg (DGS) and OneMap
database share the same physical location or data server. The data.gov.sg portal can access this spatial data through web
service. In the database architecture, data is not stored in distinct tables as provided by the agencies. Instead the data is
dumped into the destination perspective in tables representing X dimensions and Y dimensions
(DGS_SG_DATA_X_DIM_TD and DGS_SG_DATA_Y_DIM_TD ) and Metadata id is the primary key joining the core
tables and is used to retrieve the data sets in required formats (Chaudhari, 2011).
7.1.2 Data uploadand downloadspecifications
The data in Data.Gov.Sg (DGS) tables are extracted and transformed to specific download specifications to output the
data in a structured format. The input file from agencies also follow the same standards for upload with defined delimiters
(^^ as starting point, | as separator for multiple dimensions of the same level and || as separator for multiple dimensions of
the different levels) (iDA, 2011).
7.1.3 Destination data forms
The Data.Gov.Sg (DGS) website publishes data for external consumption in the form of TXT, XML, CSV and XLS. The
Geo-Spatial data from DGS are published in the forms – PGDB, SHP, and KML that are structured representation of
geographical coordinates. DGS also directs the user to specific URLs - Websites of the specific data providers or
agencies, HTML pages, PDF and Singstat publications that are mostly unstructured. Examples of Singstat publications of
textual data visualizations are Singstat Time Series, Data visualizer and charts. Spatial data visualizations such as maps in
Mytransport.sg, Integrated Land Information Service or URA site map are other examples of the availability of
government data in agency websites.
7.2 Linked Data Application Development Life Cycle (Idea to App)
Page 26
Question conception or
Usage context
identification
Identify the agencies
related to the
required data(ex:
LTA, DOS, SLA etc)
Is the required
data published?
Is the license of the
data of Open type?
Is the development based
on LOD API or SPARQL?
Are the data sets
linked by meaningful
relationships?
Build application
based on the Jena
framework or other
LOD consumer
clients
YES
YES
SPARQL
YES
Check with the
respective agencies
for opening up the
data
Agency release
data?
YES
End of Process
NO
Identify the API
structure of all the
required data
Is the data in the
APIs of interlinked
nature?
Can the data
integration be done
at Application Level?
YES
LOD API NO
Work with
respective agencies
to create links
between data sets
by modifying
ontologies
NO
YES
NO
NO
Build application
for internal usage?
NO
YES
NO
Figure 9. Idea to Application Flowchart
7.2 LINKED DATA APPLICATIONS
Faceted browsing is another means of exploiting Linked Data. An implementation of faceted search on Wikipedia is
available at http://dbpedia.neofonie.de. For example, a search on the term ‘Bukit Timah’ retrieves all types of results
related to the term – Place, Building, Station, School, University to name a few. The output of the search is provided in
Figure 10. Faceted search is different from normal search done through search engines such as Google and Bing that are
based on keywords while faceted search is based on semantics. Faceted browsing over Singapore public data could be
pretty useful for citizens as they might be interested to see a particular entity such as a public school and find out its
history, MRT stations nearby, achievements and other facets.
Page 27
Figure 10. Faceted Search with Linked Data
We have come up with a research idea of an intelligent Question & Answer engine called as Eagle Eye that works on top
of Singapore Government Linked Data. This engine can answer natural language questions such as “Which recent year
had a growth rate close to 50% for majority of Singapore based SME?” Our UI design of Eagle Eye is available in Figure
9.
Figure 11. An intelligent search engine idea for iDA post Linked Data migration
TrueKnowledge.com, a similar Q&A engine that works on top of Wikipedia facts is the inspiration for this ide a. The
proposed working of Eagle Eye can be explained in four steps. The first step is identifying the resources (Linked Data
URI) from the question. This type of semantic information extraction service is provided by DBpedia Spotlight through
REST APIs. The REST call http://spotlight.dbpedia.org/rest/annotate?text=Which recent year had a growth rate close to
Page 28
50% for majority of Singapore based SME&confidence=0.1&support=10 provides the below result “Which
recent year had a growth rate close to 50% for majority of Singapore based SME”
The resources highlighted in hyperlinks are resources that have been spotted in DBpedia (Wikipedia’s datastore).
Extension of this service on iDA’s resource index would spot URIs in the domain http://data.gov.sg. The second step is to
identify the relationships between the resources. For example, the relationships that can be inferred from this example are
SME is instance of the Organization class, Organization class comes under Singapore country, Growth rate is a property
of Sales class, Year is a class and Majority is subset of Group class. These relationships can be deduced by running
SPARQL queries with these resources as conditions in the WHERE clause. The third step in the process involves usage of
NLP techniques such as Syntactic Analysis for splitting the sentence in MAIN and SUB trees. This technique is
implemented using Stanford Parser. The next NLP technique is the Focus Extraction for deducing the Access Pattern (AP)
for accessing the parse tree. The fourth step is the actual lookout for RDF triples that match the conditions. The
implementation of this engine is reliant on the setup of URIs for all government resources. Eagle Eye is an intelligent
application that makes full use of Linked Data capability. iDA’s overall service offering would be greatly benefited by
applications of this type.
7.3 OTHER RECOMMENDATIONS
7.3.1 Kasabi for LinkedData Hosting
The Kasabi platform is cloud based and it allows upload of external data sets. We did not upload any government data set
for security reasons .Geonames51
, an existing data set that allows public subscription was used as an alternative. The
screenshot of Geonames data set homepage is available in Figure 24.There is a SPARQL endpoint for creating queries
along with search APIs and data extraction APIs. iDA can use Kasabi to upload its annual data sets that have Open
license. Suggested examples are Data sets containing location and attribute level data about public utility services of
Singapore. Some of the government data sets that can be accessed via Kasabiare enlisted in this link52
Figure 12. GeoNames home page in Kasabi Platform for Linked Data hosting
51 http://kasabi.com/dataset/geonames
52
http://kasabi.com/browse/datasets/results/og_category%3A148

More Related Content

Viewers also liked

Viewers also liked (9)

Bp business and information strategy alignment
Bp   business and information strategy alignmentBp   business and information strategy alignment
Bp business and information strategy alignment
 
Knowledge Management and Risk Management Connection explained with Unilever
Knowledge Management and Risk Management Connection explained with UnileverKnowledge Management and Risk Management Connection explained with Unilever
Knowledge Management and Risk Management Connection explained with Unilever
 
Buckman lab (knowledge sharing)
Buckman lab (knowledge sharing)Buckman lab (knowledge sharing)
Buckman lab (knowledge sharing)
 
Ongc India : Growth Strategy
Ongc India : Growth Strategy Ongc India : Growth Strategy
Ongc India : Growth Strategy
 
Buckmann labs KM case study
Buckmann labs KM case studyBuckmann labs KM case study
Buckmann labs KM case study
 
Html, CSS & Web Designing
Html, CSS & Web DesigningHtml, CSS & Web Designing
Html, CSS & Web Designing
 
Web Development on Web Project Report
Web Development on Web Project ReportWeb Development on Web Project Report
Web Development on Web Project Report
 
Web Design Project Report
Web Design Project ReportWeb Design Project Report
Web Design Project Report
 
Web Development using HTML & CSS
Web Development using HTML & CSSWeb Development using HTML & CSS
Web Development using HTML & CSS
 

Similar to Semantic web design for www.data.gov.sg - Technical Report

Sybase SQL AnyWhere12
Sybase SQL AnyWhere12Sybase SQL AnyWhere12
Sybase SQL AnyWhere12
Sunny U Okoro
 
project_comp_1181_AZY
project_comp_1181_AZYproject_comp_1181_AZY
project_comp_1181_AZY
Aung Zay Ya
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
Kurt Portelli
 
Oracle forms and resports
Oracle forms and resportsOracle forms and resports
Oracle forms and resports
pawansharma1986
 
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_DissertationRafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
Rafał Małanij
 
Online supply inventory system
Online supply inventory systemOnline supply inventory system
Online supply inventory system
rokista
 
Deployment Guide for Business Productivity Online Standard Suite: Whitepaper
Deployment Guide for Business Productivity Online Standard Suite: WhitepaperDeployment Guide for Business Productivity Online Standard Suite: Whitepaper
Deployment Guide for Business Productivity Online Standard Suite: Whitepaper
Microsoft Private Cloud
 

Similar to Semantic web design for www.data.gov.sg - Technical Report (20)

Sybase SQL AnyWhere12
Sybase SQL AnyWhere12Sybase SQL AnyWhere12
Sybase SQL AnyWhere12
 
Cognos Express
Cognos ExpressCognos Express
Cognos Express
 
Oracle 11g release 2
Oracle 11g release 2Oracle 11g release 2
Oracle 11g release 2
 
Networking Notes For DIT Part 1
Networking Notes For DIT Part 1Networking Notes For DIT Part 1
Networking Notes For DIT Part 1
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
2016 - From Internet to Robots - A Roadmap
2016 - From Internet to Robots - A Roadmap2016 - From Internet to Robots - A Roadmap
2016 - From Internet to Robots - A Roadmap
 
project_comp_1181_AZY
project_comp_1181_AZYproject_comp_1181_AZY
project_comp_1181_AZY
 
Big data technologies : A survey
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey
 
Software Project Management: Software Architecture
Software Project Management: Software ArchitectureSoftware Project Management: Software Architecture
Software Project Management: Software Architecture
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
 
R data
R dataR data
R data
 
Database Migration
Database MigrationDatabase Migration
Database Migration
 
Oracle forms and resports
Oracle forms and resportsOracle forms and resports
Oracle forms and resports
 
Project final report
Project final reportProject final report
Project final report
 
Rafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_DissertationRafal_Malanij_MSc_Dissertation
Rafal_Malanij_MSc_Dissertation
 
Fbiht71598909167
Fbiht71598909167Fbiht71598909167
Fbiht71598909167
 
Online supply inventory system
Online supply inventory systemOnline supply inventory system
Online supply inventory system
 
KHAN_FAHAD_FL14
KHAN_FAHAD_FL14KHAN_FAHAD_FL14
KHAN_FAHAD_FL14
 
Deployment Guide for Business Productivity Online Standard Suite: Whitepaper
Deployment Guide for Business Productivity Online Standard Suite: WhitepaperDeployment Guide for Business Productivity Online Standard Suite: Whitepaper
Deployment Guide for Business Productivity Online Standard Suite: Whitepaper
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 

More from Muthu Kumaar Thangavelu

Unilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business IntelligenceUnilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business Intelligence
Muthu Kumaar Thangavelu
 
Load balancing implementation in wireless networks
Load balancing implementation in wireless networksLoad balancing implementation in wireless networks
Load balancing implementation in wireless networks
Muthu Kumaar Thangavelu
 
Boeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case studyBoeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case study
Muthu Kumaar Thangavelu
 
Caravan insurance data mining statistical analysis
Caravan insurance data mining statistical analysisCaravan insurance data mining statistical analysis
Caravan insurance data mining statistical analysis
Muthu Kumaar Thangavelu
 

More from Muthu Kumaar Thangavelu (13)

Semantic web design for www.data.gov.sg - Presentation
Semantic web design for www.data.gov.sg - PresentationSemantic web design for www.data.gov.sg - Presentation
Semantic web design for www.data.gov.sg - Presentation
 
Unilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business IntelligenceUnilever's Lipton Risk Management with Business Intelligence
Unilever's Lipton Risk Management with Business Intelligence
 
Ul lipton-presentation v4
Ul lipton-presentation v4Ul lipton-presentation v4
Ul lipton-presentation v4
 
Information to Intelligence (BI Context)
Information to Intelligence (BI Context)Information to Intelligence (BI Context)
Information to Intelligence (BI Context)
 
Load balancing implementation in wireless networks
Load balancing implementation in wireless networksLoad balancing implementation in wireless networks
Load balancing implementation in wireless networks
 
Human Capital Management
Human Capital ManagementHuman Capital Management
Human Capital Management
 
Boeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case studyBoeing rocketdyne radical innovation case study
Boeing rocketdyne radical innovation case study
 
Habits that Knowledge workers need to cultivate
Habits that Knowledge workers need to cultivateHabits that Knowledge workers need to cultivate
Habits that Knowledge workers need to cultivate
 
Knowledge process productivity indexing schema
Knowledge process productivity indexing schemaKnowledge process productivity indexing schema
Knowledge process productivity indexing schema
 
Innovation management in fashion industry
Innovation management in fashion industryInnovation management in fashion industry
Innovation management in fashion industry
 
Linked data migrational framework
Linked data migrational frameworkLinked data migrational framework
Linked data migrational framework
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining statistical analysis
Caravan insurance data mining statistical analysisCaravan insurance data mining statistical analysis
Caravan insurance data mining statistical analysis
 

Recently uploaded

Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 

Recently uploaded (20)

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 

Semantic web design for www.data.gov.sg - Technical Report

  • 1. Page 1 NANYANG TECHNOLOGICAL UNIVERSITY Wee Kim Wee School ofCommunication & Information Linked Data Migrational Framework for Singapore Government Data Sets Technical Report Prepared under the guidance of Dr. Khoo Soo Guan, Christopher (Assoc Prof) Submitted by SESAGIRI RAAMKUMAR ARAVIND (aravind002@e.ntu.edu.sg) THANGAVELU MUTHU KUMAAR (muthu1100@gmail.com) KALEESWARAN SUDARSAN (sudarsan.srcm@gmail.com)
  • 2. Page 2 Table of Contents 1.0 ABSTRACT.............................................................................................................................................................4 2.0 INTRODUCTION.....................................................................................................................................................4 3.0 METHODOLOGY....................................................................................................................................................5 3.1 Analysis.........................................................................................................................................................5 3.1.1 Artifacts Analyzed................................................................................................................................6 3.1.2 Source Data Forms ..............................................................................................................................6 3.1.3 ExistingAPI SERVICESfor Data Consumption.........................................................................................6 3.1.4 Issues identifiedwith the Existing DGS System ......................................................................................6 3.2 Design...........................................................................................................................................................8 3.3 Evaluation.....................................................................................................................................................8 4.0 MIGRATIONAL FRAMEWORK .................................................................................................................................9 4.1 STEP 1 - Specification ............................................................................................................................................9 4.1.1 SUB STEP 1 - IdentificationActivities at Program Level ..................................................................................9 4.1.2 SUB STEP 2 - Analysis Activities at Project Level ..........................................................................................12 4.1.3 Identified Issues........................................................................................................................................12 4.1.4 Recommended Tools.................................................................................................................................12 4.2 STEP 2 - Object Modeling.....................................................................................................................................12 4.2.1 Identified Issues........................................................................................................................................13 4.2.2 Recommended Tools.................................................................................................................................13 4.3 STEP 3 - Ontology Modeling.................................................................................................................................13 4.3.1 Issues.......................................................................................................................................................14 4.3.2 Recommended Tools.................................................................................................................................15 4.4 STEP 4 - URI Naming Strategy...............................................................................................................................15 4.4.1 Selection of URI Administration Modes ......................................................................................................15 4.4.2 URI Taxonomy ..........................................................................................................................................15 4.4.3 Identified Issues........................................................................................................................................16
  • 3. Page 3 4.4.4 Recommended Tools.................................................................................................................................16 4.5 STEP 5 - RDF Creation..........................................................................................................................................16 4.5.1 Static Files Conversion (S2R)......................................................................................................................17 4.5.2 Relational Database Conversion (D2R)........................................................................................................17 4.5.3 Application Programming Interface Conversion (A2R) .................................................................................17 4.5.4 Identified Issues........................................................................................................................................18 4.5.5 Recommended Tools.................................................................................................................................18 4.6 STEP 6 - External Linking......................................................................................................................................18 4.6.1 Identified Issues........................................................................................................................................19 4.6.2 Recommended Tools.................................................................................................................................19 4.7 STEP 7 - Datasets Publication...............................................................................................................................19 4.7.1 Triple Store...............................................................................................................................................19 4.7.2 Metadata Publication................................................................................................................................19 4.7.3 SPARQL and Linked Data API......................................................................................................................20 4.7.4 Linked Data External Hosting .....................................................................................................................20 4.7.5 Identified Issues........................................................................................................................................21 4.7.6 Recommended Tools.................................................................................................................................21 4.8 STEP 8 - Exploitation and Discovery......................................................................................................................21 4.8.1 Identified Issues........................................................................................................................................22 4.8.2 Recommended Tools.................................................................................................................................23 5.0 Conclusion..........................................................................................................................................................23 6.0 References..........................................................................................................................................................23 7.0 Appendix............................................................................................................................................................25 7.1 Singapore Data Eco system – Extended Analysis ............................................................................................25 7.1.1 Data transfer and storage in data.gov.sg (DGS) system ........................................................................25 7.1.2 Data upload and download specifications ...........................................................................................25 7.1.3 Destination data forms ......................................................................................................................25
  • 4. Page 4 7.2 Linked Data Applications..............................................................................................................................26 7.3 Other recommendations..............................................................................................................................28 7.3.1 Kasabi for Linked Data Hosting...........................................................................................................28 1.0 ABSTRACT The subject area of this report is Linked Data and its application to the Government domain. Linked Data is an alternative method of data representation that aims to interlink data from varied sources through relationships. Governments around the world have started publishing their data in this format to assist citizens in making better use of public services. This report provides an eight step migrational framework for converting Singapore Government data from legacy systems to Linked Data format. The framework formulation is based on a study of the Singapore data ecosystem with help from Infocomm Development Authority of Singapore. Each step in the migrational framework comprises of objectives, recommendations, best practices, entry & exit points and issues. This work builds on the existing Linked Data literature, implementations in other countries and cookbooks provided by Linked Data researchers. iDA can use this report to gain an understanding of the effort and work involved in the implementation of Linked Data system on top of their legacy systems. The Appendix section of this document and two companion documents contain details about evaluation of the recommended Linked Data tools with two Singapore data sets and a tools inventory applicable to the framework. The framework is be evaluated by building a Proof of Concept (POC) application. 2.0 INTRODUCTION ‘Linked Data’ is a standard establishing a common format for describing data sets. It allows publication in the global open data cloud1 to link with other data sets, thereby forming a part of the web of data. This study focuses on the implementation aspects of such a data standard for Infocomm Development Authority (iDA)’s data.gov.sg (DGS) portal. The portal publishes the governance data provided by different ministries and agencies in Singapore. The current government data eco system has been studied and a customized Linked Data migration framework has been designed by selectively using the conversion tools and techniques in different stages of migration. The migration steps are elaborated by using population trends and urban authority’s sites data sets. The cookbooks, guidelines and recommendations provided by the Linked Data research communities were helpful in understanding the general steps and tools required in converting and publishing government data in Linked Data format. Governments that are new entrants in adopting Linked Data need a tailored data migration framework specific to their local data ecosystem. Existing case-studies recommend different tools for migration thereby creating an additional overhead of tool evaluation before implementation. Therefore, it is necessary to recommend the best tool based on evaluation in the local context. The current study aims to build a Linked Data migrational framework that could be used by iDA and Singapore Government agencies to publish their data sets in the form of Linked Data. The framework build process is based on the metadata and specifications provided by iDA and government agencies. The explanation of each step in the framework is given with the two data sets provided by Urban Redevelopment Authority and Department Of Statistics (mentioned in Table 1). Each step in the framework comprise of objectives, recommendations and issues identified by us based on careful evaluation. The current study focuses on linking the internal data sets. Additionally, it aims to provide recommendations on a few use-cases that leverage the utility of external Linked Data. 1 LOD cloud http://richard.cyganiak.de/2007/10/lod/
  • 5. Page 5 Data set Agency Category Data type Existing Applications Resident Population by DGP Zone/ Subzone and Age Group, Type of Dwelling, Ethnic Group2 Department of Statistics Population and Household Characteristics Textual Singstat Time Series (STS), charts on singstat themes Sites Sold by URA - Details3 Urban Redevelopment Authority (URA) Housing and Urban Planning Textual OneMap view of site location Table 1. Selected Datasets for Linked Data Conversion Study The pilot datasets have been selected to demonstrate the capability of linked data to interlink domains (Urban planning and population demographics) based on an application for business executives intending to purchase a new site listed by Urban Redevelopment Authority (URA). The datasets from Urban Redevelopment Authority (URA) and Department of Statistics (DOS) have been used to demonstrate the application. There is a possibility of missing some sources as the analysis output has not been signed off by iDA. 3.0 METHODOLOGY Figure 1: Research Methodology – Linked Data Migrational Framework Design for iDA The activities in the project were classified under three phases – Analysis, Design and Evaluation. 3.1 ANALYSIS The phase dealt with the collection of required information from the source parties iDA, SLA and NIIT through face to face meetings followed by investigation of the current public data ecosystem of Singapore Government. Singapore Government publishes its public data in a portal data.gov.sg (DGS) as a part of the eGov2015 initiative4 . Infocomm Development Authority of Singapore (iDA) 5 maintains the portal. iDA collects data from different government agencies. The data portal aims to meet Singapore public’s data needs and to establish a co-creative environment & 2 http://data.gov.sg/common/search.aspx?s=default&o=a&a=DOS&count=50&page=40 3 http://data.gov.sg/common/search.aspx?q=Sites+Sold+by+URA&a=URA&page=1 4 http://www.egov.gov.sg/egov-masterplans/egov-2015/vision-strategic-thrusts 5 http://www.ida.gov.sg/home/index.aspx
  • 6. Page 6 efficiently connect to public. The key actors identified in the data.gov.sg system are agencies providing data, data consumers (public), Infocomm Development Authority to interface & publish the data and the application developers exploiting the data with useful applications for the public. The governance data is currently provided by seven ministries and 50 agencies or statutory boards. As in most countries, Ministries in Singapore govern over agencies thereby forming a natural hierarchy. For example, Ministry of Transport governs over Public Transport Council, Land Transport Authority, Civil Aviation Authority, Maritime and Port Authority. It has been identified that hierarchies have not been considered in the data.gov.sg system’s user interface and hence becoming a downside for the data consumers in associating data from the same domain under different end points. The existence and flow of data in the current data eco system has been studied with iDA system design documents preceeded by discussions with Infocomm Development Authority (iDA), Singapore Land Authority(SLA) and iDA IT vendors NIIT technologies staff working on the data.gov.sg (DGS) and OneMap6 projects. 3.1.1 Artifacts Analyzed - Source Data forms - Data transfer and storage in data.gov.sg (DGS) system - Data upload and download specifications - Destination data forms - Existing APIs for data consumption - Selected data sets for Linked Data conversion study - Current data consumption points for the selected data 3.1.2 Source Data Forms The data provided by the agencies exists either in textual or spatial form. The textual data from different agencies are stored in SG DATA database maintained by Department of Statistics (DOS). The spatial data from agencies are maintained in OneMap database maintained by Singapore Land Authority (SLA). 3.1.3 Existing API SERVICES for Data Consumption DGS provides Application Programming Interfaces (API) to facilitate easy data consumption in customized applications for the public and developer communities. OneMap API provided by Singapore Land Authority, Traffic-related APIs from Land Transport Authority, Tourism-related APIs from the Singapore Tourism Board, Environment-related APIs from the National Environment Agency and Library-related data feeds and web services from National Library Board are the APIs provided by DGS. 3.1.4 Issues identifiedwiththe ExistingDGSSystem 1. Agencies do not provide raw data to Infocomm Development Authority. Aggregated report data from Agencies is split into two dimensions and one factual field: - X dimensions representing columns, Y dimensions representing rows and ‘data points’ field representing cells. These fields are provided in an XML file and sent to Infocomm Development Authority on a periodic basis. There is no separate master data file. The hierarchy in master data dimensions is not explicitly set or provided Therefore, a mechanism to identify the master data and the relationship between different levels in the master data dimensions needs is not available. (Refer Appendix, Section 1, Singapore Data Eco System, Data transfer and storage) 2. In data.gov.sg (DGS) database, the data is dumped into just two columns and the references are made with the field ‘Metadata id’. There is an inherent lack in meaning about the stored data and its attributes (Refer Appendix, Section 1, Singapore Data Eco System, Data transfer and storage). In most cases, there are issues with granularity (For example, location is expressed in streets and roads in Urban Redevelopment Authority ‘sites for sale’ dataset, whereas it is expressed as planning areas in the Department of Statistics ‘population trends’ data set). 6 http://onemap.sg/
  • 7. Page 7 3. There is a conflict of Metadata standards in Textual and Spatial data, DGS and OneMap database (for example, ‘category’ has different set of values and classification in data.gov.sg7 and OneMap8 ). 4. There is no master data management system that standardizes the master data values across agencies. Standardization is required to link common data in the data sets. 5. The current Governmental Data ecosystem of Singapore provides multiple end points to the users such as API, web services and files (Figure 2) making interlinking a complex process in application development (Refer Appendix, Section 1, Singapore Data Eco System, Destination Data forms). 6. APIs available for data consumption at end-user level, restrict the data view and inter-linking possibility by fixing the coverage area of a single API call to a single domain such as transport or health. OneMap API is an example (Refer Figure 2 of the Methodology section). 7. The download/upload specifications of the XML files sent by individual agencies have redundancy in structure therefore ending up with bigger file sizes (Refer Part VI – Appendix, 1, DGS Eco System, Data upload and download specifications). 7 http://data.gov.sg/Metadata/SGMatadata.aspx?id=0706010000000007035O&mid=26883&t=TEXTUAL 8 http://www.onemap.sg/themeexp/themeexplorer.aspx
  • 8. Page 8 Figure 2. Singapore Government Data Ecosystem (A clear and larger view of the figure is shared in Google drawing board19 ) 3.2 DESIGN The construction of the framework was done in the Design phase. The devised steps encompass every activity involved in the migration with clearly identified entry and exit points in each step. An inventory list of the Linked Data tools that can be used in each step of the framework has been prepared by collecting information from literature and other web sources. 3.3 EVALUATION The devised framework along with the recommended tools was evaluated with the pilot data sets. Three prominent RDF Conversion tools (Google Refine, RDF Views and RDF Sponger) have been used for testing the migration of the data to Linked Data format. The issues that iDA could encounter during the actual implementation was identified by this process of evaluation of the tools and steps. The Key objective is to provide a launch pad for iDA to get a first-hand view of the existing Linked Data tools in the market. The mitigation measures for the issues identified in each step have also been discussed. Evaluation criteria, usage
  • 9. Page 9 and issues encountered are summarized with appropriate screenshots. This is intended to save time for iDA during the actual implementation. Final steps after RDF creation provide less scope for evaluation in current research. There may be more issues encountered with the increase in volume of data. 4.0 MIGRATIONAL FRAMEWORK The migrational framework has been devised based on the SG Data eco system study and Linked Data literature review. iDA can use the framework in an incremental/prototype mode so as to get an early view of the final output. The following subsections represent each step explained using a Business Process perspective i.e. Input, Process and Output followed by Issues,Recommended Tools and Suggestions (as in Figure 3). Figure 3. Migrational Framework 4.1 STEP 1 - SPECIFICATION Specification is a de-facto step for a project at enterprise or government level. The nature of this step varies at the Program (iDA’s Linked Data Program) and Project (Agency Linked Data implementation). The activities in the step were decided based on the study of the system and database specification documents of the data.gov.sg (DGS) to understand the logical model connecting the tables that is used to generate data sets and interpret the required There are two sub-steps in the process: - Identification (Program level) and Analysis (Project level). 4.1.1 SUB STEP 1 - IdentificationActivitiesatProgram Level 4.1.1.1 High Level Architecture Figure 4 provides various migration options for the different type of sources. There are two types of data sources – Structured and Unstructured. Structured data is from excel, csv, text (with header) files and RDBMS. Unstructured data is
  • 10. Page 10 found in PDF, HTML and image files. We see usage of most methods from Figure 4 except CMS/RDFa, for iDA Linked Data implementation. Figure 4. High level Architecture of Linked Data (Hyland and Wood, 2011) 4.1.1.2 Data SetMigration Potential The source data sets are to be classified for deciding the order of Linked Data implementation. The classification is based on two measures 1) Utility level of data set and 2) Scope for interlinking with other data sets across domains. Based on these two measures,a data set can be classified into high, medium and low potential levels. Data set Data set URL Data Type Agency Utility Level Interlinking Possibility Potential Level Annual Vehicle Population by Type of Fuel Use URL Textual (PDF) LTA H L M Administrative Data - Employment Statistic URL Textual (HTML) MOM H M H After Death Facilities URL Spatial (KML) NEA L H H Listing of approved Building Plans URL Textual (XLS) BCA L H H Table 2. Classification of data sets based on Migration Potential 4.1.1.3 Perspective Perspective is a direction based factor for Linked Data implementation. There are two options (Vertical and Horizontal) - Agency specific implementation where specific agencies such as National Environmental Agency or Department of Statistics choose to convert their domain data. This is a vertical perspective. The second option is Application specific implementation where the data used by specific applications are converted. This is horizontal perspective as the data for
  • 11. Page 11 an application can be retrieved from different agencies based on the usage context. The recommended option is Agency specific implementation, similar to the UK Linked Data implementation. For example, Linked Data implementation could be started with SLA data. 4.1.1.4 DatasetLicensing The next factor is the assignment of license for data sets. This often depends on the source of data. The license type is based on the sensitivity and confidentiality of data. The access rights and means of data access are decided based on license (Refer Table 4). – categorize the data set typesand assign the appropriate license Few types of available licenses 1) The Open Database License (ODbL)9 2) Open Data Commons Attribution License10 3) The Creative Commons Licenses11 4) Public Domain and Dedication License12 Data Set Dataset URL Data Type Parent Agency Assigned License Access Rights Data Access Modes Annual Vehicle Population by Type of Fuel Use URL Textual (PDF) LTA PDDL R API, SPARQL, RDF Dump Administrative Data - Employment Statistic URL Textual (HTML) MOM PDDL R API, SPARQL, RDF Dump After Death Facilities URL Spatial (KML) NEA ? R API,SPARQL Listing of approved Building Plans URL Textual (XLS) BCA CC R API, SPARQL, RDF Dump DGS Tables NA Textual(RDBMS) iDA NA NA API Table 3. Classification of data sets based on License For example, the first two data sets with open type licenses can release data on the web directly as RDF Dumps whereas the other data sets might require restricting RDF access directly to users and provide just the SPARQL endpoint (A means to access data in Linked Data) or through Application Programming Interface (API). 4.1.1.5 Implementation Method Linked Data implementation method is a very important factor as it determines the overall effectiveness of the Linked Data solution. We have identified three methods based on different usage scenarios and complexity level. Method1 – Linked Data only (just URI linking), the second quickest mode of implementation enables re-use of instances across agencies. It can be used for simple publishing as there will not be many RDF triples to explain the resource. An 9 http://opendatacommons.org/licenses/odbl/ 10 http://opendatacommons.org/licenses/by/ 11 http://creativecommons.org/ 12 http://opendatacommons.org/licenses/pddl/1-0/
  • 12. Page 12 example is http://dbpedia.org/page/Jurong_Camp_I. Disadvantage is the additional work needed to be done for getting actual data about the URIs from the user and system. iDA can use this method to test the URI management process of Centralized (DGS) vs. Decentralized (Agency controlled). *Recommended* Method 2 - Linked Data + RDF,the most comprehensive use of Linked Data requires iDA to do additional work to create RDF data for each resource. An example of this type of implementation is http://dbpedia.org/page/Singapore. Therefore, it needs extensive storage. iDA can use this method to completely realize the benefits of Linked Data and Semantic Web standards. A decision to use this mode can be taken after evaluating the first prototype/POC application Method 3 – Linked Data for files (URIs for files only), the quickest mode of implementation can help in Search and Retrieval operation by offering rich metadata for files in the system. There is a lack of interlinking between data sets and the applicability of this method is just for files and it is not at the data level. iDA can use this method to improve the discovery of file based data sets in DGS website. 4.1.2 SUB STEP 2 - ANALYSIS ACTIVITIES AT PROJECT LEVEL System specifications, design & integration documents (including database) of the selected data sets are to be collected for studying Metadata, schema design and Entity Relationship (ER) models. We performed this exercise by going through the documents provided by iDA. The design issues were ascertained and overall understanding of the system was achieved. Sub-step Input Process Output Identification (Program)  Budget  Program Duration  Objectives  Set of available data sets  Data set License Assignment  Implementation method selection  Architecture Design  Implementation Perspective  Classification of data sets by Migration Potential  Project Roadmap  High Level Architecture  Implementation Perspective (Verticalor Horizontal)  Classified set of data sets by Migration Potential and License  Implementation method Analysis (Project)  Specifications  Relational models  ER models Analysis Overview of data set Design issues Table 4. Overview of Specification Step 4.1.3 IDENTIFIED ISSUES 1) The improper licensing of data sets can lead to illegal usage and leakage of sensitive and critical information (eg: crime and abortion rates in Singapore). 2) Future public data Information System projects need to be planned with the possibility of having a Linked Data downstream option. Improper planning leads to changes in architecture and affect project timelines. 4.1.4 RECOMMENDED TOOLS There are no Linked Data specific tools recommended for this step due to its strategic nature. We used tools such as MS Project, MS Visio for the planning and designing purpose. 4.2 STEP 2 - OBJECT MODELING Object Modeling is aimed at identifying the objects (elements in data sets) & their relationships in the data sets. This exercise is usually carried out with help from domain experts of the respective government agencies. Objects from the data sets (eg: Location, Building Type, Race and Area) should be drawn on a whiteboard and lines should be drawn to indicate the relationships between the objects (eg: Locations fall under Zone, population falls under demographic data
  • 13. Page 13 etc). The entire exercise is to be done with coverage on mind and not based on the futuristic uses of the applications. It is recommended to normalize the model in 3NF (Third Normal Form) before starting the exercise. We drew the conceptual view of the data set objects in the whiteboard (as in Figure 5). The key learning from this experience was the relative ease in identifying the use of common classes between the data sets (ex: the use of Location object) and brainstorming the relationship between objects from different data sets. Figure 5. Object Modeling done by the Project Team Sub-step Input Process Output Object Modeling (for individual data sets and between data sets)  Overview of data sets  Relational model of data sets Drawing objects and linking them in whiteboard Conceptual view of all objects in the system with their interconnections Table 5. Overview of Object Modelling Step 4.2.1 IDENTIFIED ISSUES 1) There might be a tendency to apply high abstraction over some objects and concurrently apply high granularity to others objects. For example, middle levels in the zone hierarchy can be missed and road level data can be mistaken as classes. 2) Hastened closure of this exercise can result in poor design. The output diagram should be signed off by domain experts of the concerned government agencies. 4.2.2 RECOMMENDED TOOLS Concept map13 is apt tool for object modelling. We recommend the usage of whiteboard for this purpose even though electronic drawing of the objects and their relationships can also be done. 4.3 STEP 3 - ONTOLOGYMODELING The interlinking of data within a data set and between data sets is achieved by modelling the abstract entities close to the real world into classes, sub classes, with their data properties and instances. The structure is called ontology or concept vocabulary. The efficiency of these ontologies lies in reusing data across different data sets rather than recreating and constraining the use of data with a significant degree of structure. In the current data.gov.sg (DGS) system, the data from the two pilot data sets are used in separate contexts as singstat visualizations in charts and Urban Redevelopment Authority (URA) site map. A common use case scenario across domains is provide below:- 13 http://cmap.ihmc.us/publications/researchpapers/theorycmaps/theoryunderlyingconceptmaps.htm
  • 14. Page 14 “Consider an industrial entrepreneurintending to buy a site fromUrban Redevelopment Authority (URA) to start hisnew factory.He might be interested to look at other aspects of the site along with the data provided by URA like plot ratio and development type, to validate purchase option thoughts instantaneously,say the population trends - age group,ethnicity, living standard, employability and even the location trends – transportation available to the site location,nearest expressway,distance to the nearest port (air and sea) for shipment.” The object models of the data sets developed in the previous step was used to study the higher level relationships within data. In the population trends data, the following hierarchies and relationships are identified. The population demographics have the examining factors (Age, race, location, and gender or employment status). The location is a region, planning area or a street location. The dwelling types are HDB, Condominiums, Private Flats, or Landed Property. One of the object properties of planning area is ‘has Site Location’. This property carries an inverse relationship with the property of site location - ‘under Planning Area’. The URA site for sale data is organized based on the sites (site location as classes). The sub class identified is the type of development code, in turn has a sub class, ‘Type of development allowed’. (Refer Section 1.0 in Companion Document 1) The common field that looked inter-connectable in these datasets is ‘Location’ field. Ontology was developed to establish the relationship between the data sets based on the Location data. The existing vocabulary Geonames14 is reused. The data in different granularity levels is a challenge for standardization. Urban Redevelopment Authority (URA) sites for sale data contains Location at specific site location level only. Department of Statistics (DOS) Population trends data set contains both at planning area and specific site location levels. Geonames vocabulary contains classes at postal code level. A standardized Location vocabulary is derived with the site hierarchy provided by Urban Redvelopment Authority (URA) and mapped to both the data sets extended to postal code levels and reusing the geonames vocabulary (URI). The country has regions Central, West, East, North and North East. It has 12 urban planning areas (eg: Clementi, Boon lay, JurongWest). The urban planning area has site location like Pioneer Road North, Woodlands Avenue. The site location has a postal code (ex: 628462 for pioneer road north). Networking ontologies and extending the relationships across domains improves the machine reasoning and intuitive information retrieval from the user’s perspective as seen as above. Sub-step Input Process Output Ontology Modeling (for individual data sets and between data sets)  Individual data sets(URA and DOS data sets)  Object models developed in the previous step Creating explicit relationships with classes,sub classes, object and data properties on the domain specific information (Refer Appendix, Section 5, Figures 10, 11 & 12). Ontology linking urban planning and population domain data sets (Refer Appendix, Section 5, Figure 13) Table 6. Overview of Ontology Modelling Step 4.3.1 ISSUES 1) Resolving conflicting vocabulary and Metadata standards in data.gov.sg (DGS) and OneMap as explained in the issues with the existing system could be a challenge. 2) There is existence of different levels of granularity in data sets. For example, Location is expressed in different forms in different instances – urban planning area, specific site location or postal codes. Building a common location vocabulary to network these different granular data can be used to uniformly drill down or roll up the location information chain in applications with a single version of truth. 14 http://www.geonames.org/ontology/documentation.html
  • 15. Page 15 4.3.2 RECOMMENDED TOOLS Neon Tool Kit is recommended for design, storage, version maintenance and deployment of large ontologies (Untested). It has an option to import the standard reusable vocabularies automatically and is best suited at enterprise level for its ease of use in interfacing vocabularies and visualizing the ontology graphs with defined links and hierarchies in a single shot as flash video or image. Protégé, an open source tool can be used for prototyping. 4.4 STEP 4 - URI NAMING STRATEGY Uniform Resource Indicators (URIs) are the building blocks of RDFs. URIs provide names to the resources in the system. The aspects/factors related to this step are provided below. 4.4.1 SELECTION OF URI ADMINISTRATION MODES We have identified three URI administration modes. They are provided below:- 1) *Recommended* Maintained centrally in the DGS platform (resultant URIs will start with http://data.gov.sg/) 2) Maintained by individual agencies (resultant URIs will start with http://ura.gov.sg or http://sla.gov.sg). 3) Maintained externally by third party platforms such as Kasabi(resultant URIs will start with http://data.kasabi.org) – No longer valid as Kasabiservice has been shut down. 4.4.2 URI TAXONOMY A key decision area is the taxonomy design of the URI i.e. structure of the URIs. Most government Linked Data systems use the TBOX and ABOX approach for differentiating classes and their instances. TBOX refers to the resource URIs (http://data.gov.sg/resource/) and ABOXrefers to Ontology URIs (http://data.gov.sg/ontology/) We are of the view that URI minting process would be challenge for iDA if it goes for a piece-meal approach i.e. scattered implementation of Linked Data. URI construction should be done as a holistic process with good foresight on the future prospects of the system. iDA can arrange a workshop with all government agencies to educate and look for ideas pertaining to the representations that the agencies would want. The output of the workshop would be the taxonomy of the URI structure. We provide some recommended URI patterns for implementation as in Table 11. Table 7. URI Patterns for Singapore Data Sub-step Input Process Output URI Naming  List of resources  List of Classes and Properties Visualization of URI minting process from Linked Data tools and usage perspective  URI Administration method  Base URI  Taxonomy of URI structure Table 8. Overview of URI Naming Step ABOX TBOX http://data.gov.sg/ontology/Ministry/ http://data.gov.sg/ministry/MOH http://data.gov.sg/ontology/Agency/ http://data.gov.sg/agency/SLA http://data.gov.sg/ontology/SiteLocation http://data.gov.sg/location/pioneer_road_north http://data.gov.sg/ontology/Race http://data.gov.sg/race/chinese Dataset ID URAstaticfile001 Dataset http://data.gov.sg/dataset/ URAstaticfile001/ Class http://data.gov.sg/terms/class/URAstaticfile001/sitesforsale Property http://data.gov.sg/terms/property/URAstaticfile001/time Row 1 http://data.gov.sg/dataset/URAstaticfile001/1 Row 1 - A generic column http://data.gov.sg/dataset/URAstaticfile001/1/columnName Dataset URIs
  • 16. Page 16 4.4.3 IDENTIFIED ISSUES 1) Usage of different tools can hamper URI naming process if the selected RDF conversion tools do not support custom coding of URIs. 2) Dead links can affect applications that use the URIs. This can be monitored by using Vapour15 , a URI validation tool recommended by Pedantic Web Group16 . 4.4.4 RECOMMENDED TOOLS The URI naming strategies of (Hendler, 2012)17 , (Vila-Suero, 2012)18 and (Davidson, 2009)19 provide good guidelines. Pubby20 is a Linked Data tool that can be used for external URI dereferencing. 4.5 STEP 5 - RDF CREATION RDFs are the building blocks of Linked Data systems. This step is about the conversion of data from source systems format to Linked Data format. We have split the conversion process into three categories based on the data source type: - Static files conversion (S2R), Relational Database Management System (RDBMS) data conversion (D2R) and API data conversion (A2R). Examples of the different source formats are given in the below Table 13 and Figure 5. Type Example of Singapore data sets Source format S2R URA Site for Sales, Singstat’s Population Household Characteristics XLS, CSV, TXT files and other static files D2R DGS tables RDBMS A2R OneMap API,myTransport API,NLB web services Application Programming Interface (API) and Web Services(SOAP,REST) Table 9. Conversion Types with Examples 15 http://validator.linkeddata.org/vapour 16 http://pedantic-web.org 17 http://logd.tw.rpi.edu/instance-hub-uri-design 18 http://www.w3.org/2011/gld/wiki/223_Best_Practices_URI_Construction 19 http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector 20 http://www4.wiwiss.fu-berlin.de/pubby/
  • 17. Page 17 Figure 6. RDF Conversion Types and Endpoints for Singapore Public Data 4.5.1 STATIC FILES CONVERSION (S2R) Government agencies predominantly publish their data (about 80% of total government data sets) in static file format (ex: Excel spreadsheets) with internal representation in the form of tables or crosstabs. These files are to be loaded into Google Refine21 for transforming fields in the data set to RDF triples. Google Refine performs three core functions and one optional function a.) Correction of errors in data sets b) Creation of new fields c) Mapping fields to Ontologies d) Reconciliation with public vocabularies and data sets (optional). We used the URASiteSales excel file for evaluating Google Refine. Details of the transformations implemented, screenshots and output files are provided in Section 2.0 Companion Document 1 4.5.2 RELATIONAL DATABASE CONVERSION (D2R) We recommend the usage of STDTrip methodology22 devised by the Brazil Open Data team as the first step in the D2R conversion process. STDTrip gives six distinct steps that aid in mapping RDBMS model to Ontologies/Vocabularies. The crux of this process is to use ER diagram of DGS data model to get the corresponding classes and their properties. This process has some of its footing in Ontology Modelling step. Virtuoso Universal Server’s RDF Views23 is recommended for D2R conversion. The mapping files generated from the STDTrip process is to be converted in RDF Views syntax. The RDF conversion can be done on a daily basis (after the refresh of DGS base tables) or on need basis. We created a table directly in Universal Server for evaluating RDF Views and its mapping features. Execution steps, screenshots of the RDF Views tool are provided in Section 3.0 Companion Document 1 4.5.3 APPLICATION PROGRAMMING INTERFACE CONVERSION (A2R) RDF Sponger24 (another feature of Virtuoso Universal Server) was evaluated for A2R conversion. The A2R conversion is real-time in nature. API services provided by SLA (OneMap API), LTA (myTransport API) and NLB (web services) can be authorized in the Universal Server with authentication keys. This is followed by creation of external cartridges (mapping files) for extracting and mapping the contents of API output to corresponding classes and properties in the vocabularies used for the data sets. Most of the APIs provide data in the form of JSON 25 (OneMap) or XML (MyTransport) or plain text format. Separate cartridges are to be created for each of these content types. A sample OneMap API was used to test this functionally. Details of the OneMap API used and screenshot from RDF Sponger have been provided in Section 4.0 Companion Document 1. Sub-step Input Process Output STDTrip Process  Relational Model  ER Model  Identified Vocabularies Seven step STDTrip process for generating mapping files Mapping and Schema files S2R  Spreadsheets(xls, csv)  Other text formats  Cleansing  Transformation RDF Triples D2R  SQL of tables Mapping files Conversion from table format to RDF RDF Triples A2R  API URLs  Mapping files  Identification of API content type  Conversion to RDF RDF Triples Table 10. Overview of RDF Creation Step 21 http://code.google.com/p/google-refine 22 STDTrip Methodology http://www.inf.puc-rio.br/~casanova/Publications/Papers/2010-Papers/2010-SBBD-StdTrip.pdf 23 http://virtuoso.openlinksw.com/whitepapers/relational%20rdf%20views%20mapping.html 24 http://virtuoso.openlinksw.com/presentations/Virtuoso_Sponger_1/Virtuoso_Sponger_1.html 25 http://www.json.org/
  • 18. Page 18 4.5.4 IDENTIFIED ISSUES 1) Issues in this step are mainly related to data quality, versioning, source system unavailability, URI management and non-adherence to mapping rules. Existing Linked Data literature on handling updates to existing data are not extensive and iDA would be best advised to devise a strategy that tests the prototype with multiple real-time updates (eg: traffic related data of LTA). 2) The real-time nature of A2R makes it completely dependent on source APIs. Absence of intimation about API outages can cause the system to return null or invalid results (if caching is enabled). 4) Google Refine doesn’t dereference the URI that it creates for each row in the source file. Dereferencing is to be done externally. Pubby26 provides this functionality as a solution. 5) Changes to SQL queries, updates to DB tables and changes to API output are to be indicated to the appropriate personnel before implementation. Mapping files are supposed to be updated before the aforementioned changes go into production. 4.5.5 RECOMMENDED TOOLS The complete list of tools for RDF creation step is available in Section 2.2 of Companion Document 2. Selection of the tool is largely dependent on the inter-operability features. We recommend Virtuoso Universal Server for its completeness for both D2R & A2R and Google Refine for S2R process because of its support for complex custom scripting that is needed while transforming data from its source format to the required format(eg: multidimensional nature of statistics related data provided by Department of Statistics). 4.6 STEP 6 - EXTERNAL LINKING Linked Data is known for its prescription of four key principles (http://www.w3.org/DesignIssues/LinkedData.html). Even though, the adoption of fourth principle is optional for governments and enterprises, the web of data vision can be brought forth only by linking data sets across the globe (web of data depicted by LOD cloud27 ). Current DGS system doesn’t make use of external data or provide private firms data. Linked Data applications would be greatly benefited on the existence of links between government data sets and popular data hubs such as Dbpedia (Wikipedia) and Freebase. We have identified some scenarios of linking Singapore data sets with external data sets from an application development perspective:- 1) Wikipedia for use of textual content pertaining to places, events and entities in Singapore 2) Geonames28 for use of spatial data 3) Flickr wrapper29 for use of openly licensed images pertaining to places and events 4) World Bank30 and CIA Factbook31 data sets to facilitate economic analysis of Singapore and other nations on common dimensions 5) UN data sets such as FAO32 data sets for comparing import and export of commodities in Singapore 6) Supreme Court database (SCDB) with Dbpedia and News agency like LosAngeles Times and NewYork Times for analyzing the decisions made33 . We recommend data set owners to look at the open government data catalogue34 and datahub35 for potential data sets to identify useful data sets that could be linked to SG data sets. The identified data sets can be compared with the 26 http://www4.wiwiss.fu-berlin.de/pubby/ 27 http://richard.cyganiak.de/2007/10/lod/ 28 http://linkedgeodata.org/ 29 http://www4.wiwiss.fu-berlin.de/flickrwrappr/ 30 http://worldbank.270a.info/ 31 http://www4.wiwiss.fu-berlin.de/factbook/ 32 http://data.fao.org/ 33 http://www.youtube.com/watch?v=hawT84z0U30 34 http://opengovernmentdata.org/data/catalogues/
  • 19. Page 19 government data set using tools such as SILK36 and LIMES37 that automate the linking process between data sets. The linking is not contextual instead it is made between semantically related resources (use of the OWL property owl:sameas). Most of the links will be made to the data sets of DBpedia and Geonames as these services are regarded as good hubs of general factualand spatial data respectively. An example is provided below. <http://data.gov.sg/location/bugis> <owl:sameAs> <http://www.dbpedia.org/resource/Bugis> <http://data.gov.sg/race/chinese> <owl:sameAs><http://www.dbpedia.org/resource/Malay_race> Sub-step Input Process Output External Linking  Demographic and Spatial data sets (provided by DOS and SLA)  Identified external data sets (DBpedia, Geonames, CIA Factbook) Linking suggestions provided based on Similarity algorithms Outbound links to external data sets Table11. Overview of External Linking Step 4.6.1 IDENTIFIED ISSUES 1) The outbound links made to data sets outside of iDA’s purview can be risky based if the external source’s data credibility is brought into question(ex: Wikipedia is built on user content and can be openly edited). 2) Dead links can arise during the change of resource URIs or system downtime. There is an alternate solution of importing the RDF dump from the external data sets for local storage and use. 4.6.2 RECOMMENDED TOOLS SILK41 and LIMES42 are industry standard tools that can automate the process of finding links based on similarity measures. We recommend SILK based on its wide usage. 4.7 STEP 7 - DATASETS PUBLICATION Datasets Publication happens at the end of the RDF creation step. The elements related to this step are described below. 4.7.1 TRIPLE STORE The generated RDF triples are stored in a Linked Data specific data structure called as Triple Store or RDF Store. Storage of triples can be centralized at DGS server or decentralized at servers of individual agencies. A centralized storage is recommended to avoid different versions of truth. The Figure 6 in this section depicts a common triple store for storing the triples from the three conversion processes. 4.7.2 METADATA PUBLICATION Metadata compliments actual data by providing semantics about the fields and the data set. Therefore, metadata publication is needed along with actual data publication. Literature recommends the usage of VOID38 vocabulary for this purpose since it covers general metadata,access privileges metadata and external links metadata. The triples applicable for the pilot dataset are given in Section 5.0 of Companion Document 1. Metadata publication is a key differentiator of Linked Data on comparison to traditional Information systems as the triples are created with the purpose of reducing the learning time for the developer therefore accelerating the application development process. 35 http://thedatahub.org/ 36 http://www4.wiwiss.fu-berlin.de/bizer/silk/ 37 http://limes.sourceforge.net/ 38 http://vocab.deri.ie/void
  • 20. Page 20 4.7.3 SPARQL AND LINKED DATA API SPARQL39 is to Linked Data in the same vein as SQL is to Relational Database based Information systems. RDF data can be queried only through SPARQL. The biggest advantage of SPARQL is the CONSTRUCT query that can generate new triples based on the characteristic of inference. For example, a set of triples can be generated between Regions (Central, North) and ‘Cumulative Site Tender Prices’ based on the common existence of ‘Location’ resource/field in two data sets. SPARQL may not find acceptance right away from developers having experience in using SQL and REST APIs. Therefore, an API wrapper called as Linked Data API (LDA)40 can be stationed on top of SPARQL endpoint as a suitable means of abstraction. Linked Data REST API translates the API call to SPARQL query, retrieves the resultant data from SPARQL and then translates the data to an appropriate output format such as JSON, XML, HTML or TEXT. Data extraction from Data.gov.uk is based on this concept. The URL http://gov.tso.co.uk/transport/api/transport/doc/bus-stop- point.xml retrieves a list of bus stops in UK in XML format. A sequence based illustration of Linked Data API in DGS context is given in Figure 7. The use of LDA brings in obvious advantages as it simplifies data access and provides convenient output formats. This also means that the rules for conversion from API call to SPARQL are to be created. We foresee considerable time and effort to be invested in this activity as even a small change in API parameters would need a different SPARQL query. The use of named graphs41 is critical for iDA as updates to existing RDF triples are made in the triple store. A DML query can have unexpected results if the count of triples that are to be updated is unknown. Therefore, the use of named graphs is suggested. API based updates and deletion can be a preferred mode avoiding the complex nature of SPARQL queries. Linked Data API call http://data.gov.sg/lda/ childcare/north SPARQL Query Select ?cc Where { ?cc dgs:haszone dgs:north. ?cc dgs:facilitytype dgs:childcare. } LIMIT 100 Triple Store LDA-SPARQL Mapping file Conversion from RDF to JSON RDF Triples Http://data.gov.sg/facility/cc/name1 Http://data.gov.sg/facility/cc/name2 Http://data.gov.sg/facility/cc/name3 . . . Http://data.gov.sg/facility/cc/name100 JSON Output Entry: name1 Entry: name2 . . . Entry: name100 Figure 7. Linked Data API workflow for iDA 4.7.4 LINKED DATA EXTERNAL HOSTING Linked Data hosting is another option that can be considered by iDA if it aims to use third party platforms for data storage. Hosting in cloud brings in cost benefits, abstracts the complexity of maintaining servers & installing software and provides capability for instantly creating simple REST42 based APIs for data insertion, update, deletion & retrieval activities. We evaluated the Kasabi43 platform (powered by Talis44 ) and found the interface to be intuitive. Screenshots are provided in Appendix Section 7). LOGD TWC portal (http://logd.tw.rpi.edu/) is maintained by Rensselaer Polytechnic Institute and it provides Linked Data conversion service for US government data sets. iDA can consider this option based on from the hosting vendor. UK Linked Data also makes use of Talis platform. Important Note: Kasabi service has been shut down as of July 2012 39 http://www.w3.org/TR/rdf-sparql-query/ 40 http://code.google.com/p/linked-data-api/ 41 http://www.w3.org/2004/03/trix/ 42 http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm 43 http://kasabi.com/ 44 http://www.talis.com/
  • 21. Page 21 Sub-step Input Process Output Actual data publishing  S2R RDF triples  D2R RDF triples  A2R RDF triples Data insertion through command line or in batch Data storage in triple store Metadata publishing  Vocabulary classes  Key Statistics from data set Modeling in VOID VOID triples SPARQL  SPARQL queries Data retrieval or graph construction Resultant RDF triples Linked Data API  REST based API Calls  Conversion of API call to corresponding SPARQL query  Conversion of resultant RDF triples to output format(ex: JSON) JSON/XML/HTML/TXT data Table 12. Overview of Dataset Publication step 4.7.5 IDENTIFIED ISSUES 1) SPARQL does not currently support sub-queries and many other database operations that application developer make use of. The only alternative is to build complex SPARQL queries that do the same operation. This could lead to a long- winded process of updating queries for even small change requests. A possible solution can be SPARQL query creation/update template (in excel) that could aid the developers in accelerating the query creation process. 2) Linked Data API with its simplicity actually eliminates the biggest advantage that Linked Data provides - inferencing through SPARQL CONSTRUCT query. Therefore, we recommend that a public SPARQL endpoint be maintained, to facilitate innovative querying. 3) A security lapse or an unknown exploit in Linked Data hosting environment can lead to serious implications if Linked Data is not open for public use. This risk always remains with external hosting. Evaluation of Vendor’s backup measures is important before agreeing on the contract. 4.7.6 RECOMMENDED TOOLS Virtuoso Universal Server is recommended for data publication and SPARQL endpoint. Puelia45 , PHP based Linked Data API library is recommended for API. If iDA plans to use Java for Linked Data API, Elda46 can be considered. Talis platform is recommended for Linked Data hosting. Apache Jena Framework47 (initially developed by HP Labs) provides Java based APIs for dealing with most of the elements in the Semantic Web stack (RDF, Ontology and SPARQL). We used the SPARQL API for querying publicly available SPARQL endpoints. The framework is secondarily recommended. The java files and usage steps are available in Section 6 of Companion Document 1. 4.8 STEP 8 - EXPLOITATION AND DISCOVERY iDA’s implementation of Linked Data over its existing data platform would be considered successful based on the factors such as progressive availability of new applications created using multi-agency data sets, number of API calls and number of new ideas implemented as applications. A government service provider would want the general public and enterprises to exploit its service offering and the same applies for its Linked Data offering. iDA had started the ideas4apps initiative 45 http://code.google.com/p/puelia-php/ 46 http://elda.googlecode.com/hg/deliver-elda/src/main/docs/index.html 47 http://incubator.apache.org/jena/
  • 22. Page 22 to look for new ideas that could be implemented as webapps or mobile apps on top of government data. Such initiatives will have more impact with the data sets released in Linked Data format. The two types of discovery are as follows:- 1.) Internal discovery within Singapore for local citizens 2.) External discovery for attracting usage of Singapore government data in international economic and political research Sub-step Input Process Output Exploitation & Discovery  Linked Data  Linked Data Existing Applications  Gamification  Crowdsourcing  Converting Ideas to Apps (Related idea Refer Appendix, Section 2, Figure 7)  Registering in public catalogues  New Applications(ex: Eagle Eye) Refer Appendix, Section 3, Figure 9  External reference from LOD cloud Table 13. Overview of Exploitation & Discovery Step Figure 8 is representative of the final end product of POC. The middle box containing the site details for Pasir Ris is from the parent data set of URA. The box in the right contains demographic details of Pasir Ris, retrieved from the DOS data sets and the box in the left contains area facilities available in Pasir Ris, provided by SLA through OneMap API. The display of these two boxes with data coming from different agencies is made possible by using a common URI for Pasir Ris and by re-using this URI across three different data sources. Figure 8. Post Linked Data Migration – A snap shot of an application (After interlinking urban planning and population domains) Another application prototype designed for Singapore Data sets is listedin Appendix, Section 7.2 4.8.1 IDENTIFIED ISSUES There are no technical issues related to this step as it is majorly related to publicity of the Linked Data system supported by iDA. Technical flaws, data quality issues and application level misrepresentations are the possible issues that could cascade from the earlier steps. Faceted browsing of Linked Data can cause serious performance issues as data retrieval is quite huge for this purpose. Public availability of this service can be restricted even though it is one of the best ways of slicing and dicing data for getting to the required point of satisfaction in knowledge discovery.
  • 23. Page 23 4.8.2 RECOMMENDED TOOLS Linked Data browsers Sig.ma48 , Tabulator49 and OpenLink Data Explorer50 can be used over Singapore data sets for faceted browsing. 5.0 CONCLUSION This report’s main contribution is its applicability to the Singapore Government data. iDA can use this report as a reference to plan their strategy for Linked Data implementation that seems imminent in the near future. The report shows the method of interlinking two data sets sourced by two different government agencies along with the steps involved in using the recommended tools (Google Refine, RDF Views and RDF Sponger) with these data sets. The identified issues and recommendations in each step will be highly useful during iDA’s actual implementation. The work is incomplete in the view that a POC application was not built using the framework. Therefore, we foresee the possibility of unidentified data maintenance issues propping up during implementation. The experience of learning about new technologies, new methods of data representation, working through the case studies of other countries and contemplating on selection of best tools, was enriching for us 6.0 REFERENCES Bauer,F., & Kaltenbock, M. (Eds.). (2012). Linked Open Data: The Essentials: A Quick Start Guide for Decision Makers. edition mono/monochrom. Retrieved from http://amazon.de/o/ASIN/3902796057/ Berners-Lee,T.,Hendler, J., & Lassila, O. (2001). THE SEMANTIC WEB. Scientific American,284(5), 34 Berners-Lee, T. (2006). Linked Data. Available: http://www.w3.org/DesignIssues/LinkedData.html. Last accessed 11th Jan 2012 Bizer , C., Heath, T., Idehen, K., & Berners-Lee, T. (2008). Linked Data: Evolving the Web into a Global Data Space. (J. Hendler & F. Van Harmelen, Eds.)Proceeding of the 17th international conference on World Wide Web WWW 08 (Vol. 1, p. 1265). ACM Press. Blumauer, S. (2012). Open Data for Enterprises. Available: http://www.slideshare.net/ABLVienna/open-data-for- enterprises-12126124. Last accessed 20th April 2012 Chee Hean, T. (2011). Keynote Address by Mr Teo Chee Hean, Deputy Prime Minister, Coordinating Minister for National Security and Minister for Home Affairs at the e-Gov Global Exchange 2011. Available: http://www.ida.gov.sg/News%20and%20Events/20110620114104.aspx?getPagetype=21. Last accessed 11th Jan 2012 Davidson, P. (2009). Designing URI Sets for the UK Public Sector.Available: http://www.cabinetoffice.gov.uk/resource- library/designing-uri-sets-uk-public-sector. Last accessed 20th April 2012. DiFranzo, D.,Graves, A., Erickson, J. S., Ding, L., Michaelis, J., Lebo, T., Patton,E., et al. (2011). The Web is My Back- end: Creating Mashups with Linked Open Government Data. In D. Wood (Ed.), Linking Government Data (pp. 205- 219). Springer New York. Retrieved from http://dx.doi.org/10.1007/978-1-4614-1767-5_10 Dodds, L., & Dodds, L. (2011). Linked Data PatternsA pattern catalogue formodelling , publishing , and consuming Linked Data and consuming Linked Data. Retrieved from http://patterns.dataincubator.org/book/ Haase, P., Rudolph, S., Wang, Y., Brockmans, S., Palma, R., Euzenat, J., & d’ Aquin, M. (2006, November). Networked Ontology Model.TechnicalReport, NeOn project deliverable D1.1.1 48 http://sig.ma/ 49 http://www.w3.org/2005/ajar/tab 50 http://ode.openlinksw.com/
  • 24. Page 24 Halb, W., Raimond,Y.,Hausenblas,M . (2008). Building Linked Data For Both Humans and Machines.. WWW 2008 workshop : Linked Data On the Web. Heath,T., & Goodwin, J. (2011). Linking Geographical Data for Government and Consumer Applications. In D. Wood (Ed.), Linking Government Data (pp. 73-92). Springer New York. Retrieved from http://dx.doi.org/10.1007/978-1- 4614-1767-5_4 Hyland, B. and Wood, D. (2011). The joy of data - a cookbook for publishing linked government data on the web linking government data. In Wood, D., editor, Linking Government Data, chapter 1, pages 3-26. Springer New York, New York, NY. Hyland, B. (2012). Linked Data Cookbook. Available: http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook. Last accessed 20th April 2012. Jim Hendler. (2012). URI Design Principles: Creating Unique URIs for Government Linked Data. Available: http://logd.tw.rpi.edu/instance-hub-uri-design. Last accessed 14th April 2012. Miller, A. (2011). Releasing Relational Data to the Semantic Web.Available: http://www.slideshare.net/alexmiller/releasing-relational-data-to-the-semantic-web-7634727. Last accessed 20th April 2012. Passant,A.,Laublet, P.,Breslin, J. G., & Decker,S. (2010). Enhancing Enterprise 2.0 Ecosystems Using Semantic Web and Linked Data Technologies:The SemSLATES Approach. In D. Wood (Ed.), Linking Enterprise Data (pp. 79- 102). Springer US. Retrieved from http://dx.doi.org/10.1007/978-1-4419-7665-9_5 Pollock, J.T. (2009). Semantic Web for dummies. Hoboken,NJ: John Wiley. Quarteroni, S. (2011). Question Answering, Semantic Search and Data Service Querying. Proceedings of the KRAQ11 workshop (pp. 10-17). Chiang Mai: Asian Federation of NaturalLanguage Processing. Retrieved from http://www.aclweb.org/anthology/W11-3102 Salas, P.,Viterbo, J.,Breitman, K., & Casanova,M. A. (2011). StdTrip: Promoting the Reuse of Standard Vocabularies in Open Government Data. In D. Wood (Ed.), Linking Government Data (pp. 113-133). Springer New York. Retrieved from http://dx.doi.org/10.1007/978-1-4614-1767-5_6 Shadbolt, N.,Hall, W., Berners-Lee,T. (2006). The Semantic Web Revisited . IEEE Computer Society. 21 (3), 96-101. Sheridan, J., Tennison, J.,. (2010). Linking uk government data.WWW2010 workshop: Linked Data on the Web (LDOW2010), ACM Sinclair, P. (2010). Building Linked Data Applications. Available: http://www.slideshare.net/metade/building-linked-data- applications. Last accessed 20th April 2012. Villazón-Terrazas, B., Vilches-Blázquez, L., Corcho, O., and Gómez-Pérez, A. (2011). Methodological guidelines for publishing government linked data linking government data. In Wood, D., editor, Linking Government Data, chapter 2, pages 27-49. Springer New York, New York, NY. Vila-Suero, D. (2012). Best Practices URI Construction. Available: http://www.w3.org/2011/gld/wiki/223_Best_Practices_URI_Construction. Last accessed 20th April 2012. Tennison, J. (2010). Guest Post: A Developers' Guide to the Linked Data APIs - JeniTennison. Available: http://data.gov.uk/blog/guest-post-developers-guide-linked-data-apis-jeni-tennison. Last accessed 20th
  • 25. Page 25 Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority, Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0, 28-Sept-11 Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority – Admin Portal,Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0,28-Sept-11 Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority – Database Specifications,Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0, 28-Sept-11 Chaudhari, P., (2011, September). System Design Specification For Data.gov.sg, Infocomm Development Authority – SystemIntegration,Document ID: iDA/NIIT/DGS/SDD R1.0, Release 1.0, 28-Sept-11 iDA, (2011, September). SG-Data Singapore Government Data Management Specifications – Data file specifications, Version 2.2, July-11 7.0 APPENDIX 7.1 SINGAPORE DATA ECO SYSTEM – EXTENDED ANALYSIS 7.1.1 Data transferand storage in data.gov.sg (DGS) system The data and metadata is imported to Data.Gov.Sg (DGS) data store from SG DATA database through an XML interface every day. The Spatial data from different agencies are stored in OneMap database. Data.Gov.Sg (DGS) and OneMap database share the same physical location or data server. The data.gov.sg portal can access this spatial data through web service. In the database architecture, data is not stored in distinct tables as provided by the agencies. Instead the data is dumped into the destination perspective in tables representing X dimensions and Y dimensions (DGS_SG_DATA_X_DIM_TD and DGS_SG_DATA_Y_DIM_TD ) and Metadata id is the primary key joining the core tables and is used to retrieve the data sets in required formats (Chaudhari, 2011). 7.1.2 Data uploadand downloadspecifications The data in Data.Gov.Sg (DGS) tables are extracted and transformed to specific download specifications to output the data in a structured format. The input file from agencies also follow the same standards for upload with defined delimiters (^^ as starting point, | as separator for multiple dimensions of the same level and || as separator for multiple dimensions of the different levels) (iDA, 2011). 7.1.3 Destination data forms The Data.Gov.Sg (DGS) website publishes data for external consumption in the form of TXT, XML, CSV and XLS. The Geo-Spatial data from DGS are published in the forms – PGDB, SHP, and KML that are structured representation of geographical coordinates. DGS also directs the user to specific URLs - Websites of the specific data providers or agencies, HTML pages, PDF and Singstat publications that are mostly unstructured. Examples of Singstat publications of textual data visualizations are Singstat Time Series, Data visualizer and charts. Spatial data visualizations such as maps in Mytransport.sg, Integrated Land Information Service or URA site map are other examples of the availability of government data in agency websites. 7.2 Linked Data Application Development Life Cycle (Idea to App)
  • 26. Page 26 Question conception or Usage context identification Identify the agencies related to the required data(ex: LTA, DOS, SLA etc) Is the required data published? Is the license of the data of Open type? Is the development based on LOD API or SPARQL? Are the data sets linked by meaningful relationships? Build application based on the Jena framework or other LOD consumer clients YES YES SPARQL YES Check with the respective agencies for opening up the data Agency release data? YES End of Process NO Identify the API structure of all the required data Is the data in the APIs of interlinked nature? Can the data integration be done at Application Level? YES LOD API NO Work with respective agencies to create links between data sets by modifying ontologies NO YES NO NO Build application for internal usage? NO YES NO Figure 9. Idea to Application Flowchart 7.2 LINKED DATA APPLICATIONS Faceted browsing is another means of exploiting Linked Data. An implementation of faceted search on Wikipedia is available at http://dbpedia.neofonie.de. For example, a search on the term ‘Bukit Timah’ retrieves all types of results related to the term – Place, Building, Station, School, University to name a few. The output of the search is provided in Figure 10. Faceted search is different from normal search done through search engines such as Google and Bing that are based on keywords while faceted search is based on semantics. Faceted browsing over Singapore public data could be pretty useful for citizens as they might be interested to see a particular entity such as a public school and find out its history, MRT stations nearby, achievements and other facets.
  • 27. Page 27 Figure 10. Faceted Search with Linked Data We have come up with a research idea of an intelligent Question & Answer engine called as Eagle Eye that works on top of Singapore Government Linked Data. This engine can answer natural language questions such as “Which recent year had a growth rate close to 50% for majority of Singapore based SME?” Our UI design of Eagle Eye is available in Figure 9. Figure 11. An intelligent search engine idea for iDA post Linked Data migration TrueKnowledge.com, a similar Q&A engine that works on top of Wikipedia facts is the inspiration for this ide a. The proposed working of Eagle Eye can be explained in four steps. The first step is identifying the resources (Linked Data URI) from the question. This type of semantic information extraction service is provided by DBpedia Spotlight through REST APIs. The REST call http://spotlight.dbpedia.org/rest/annotate?text=Which recent year had a growth rate close to
  • 28. Page 28 50% for majority of Singapore based SME&confidence=0.1&support=10 provides the below result “Which recent year had a growth rate close to 50% for majority of Singapore based SME” The resources highlighted in hyperlinks are resources that have been spotted in DBpedia (Wikipedia’s datastore). Extension of this service on iDA’s resource index would spot URIs in the domain http://data.gov.sg. The second step is to identify the relationships between the resources. For example, the relationships that can be inferred from this example are SME is instance of the Organization class, Organization class comes under Singapore country, Growth rate is a property of Sales class, Year is a class and Majority is subset of Group class. These relationships can be deduced by running SPARQL queries with these resources as conditions in the WHERE clause. The third step in the process involves usage of NLP techniques such as Syntactic Analysis for splitting the sentence in MAIN and SUB trees. This technique is implemented using Stanford Parser. The next NLP technique is the Focus Extraction for deducing the Access Pattern (AP) for accessing the parse tree. The fourth step is the actual lookout for RDF triples that match the conditions. The implementation of this engine is reliant on the setup of URIs for all government resources. Eagle Eye is an intelligent application that makes full use of Linked Data capability. iDA’s overall service offering would be greatly benefited by applications of this type. 7.3 OTHER RECOMMENDATIONS 7.3.1 Kasabi for LinkedData Hosting The Kasabi platform is cloud based and it allows upload of external data sets. We did not upload any government data set for security reasons .Geonames51 , an existing data set that allows public subscription was used as an alternative. The screenshot of Geonames data set homepage is available in Figure 24.There is a SPARQL endpoint for creating queries along with search APIs and data extraction APIs. iDA can use Kasabi to upload its annual data sets that have Open license. Suggested examples are Data sets containing location and attribute level data about public utility services of Singapore. Some of the government data sets that can be accessed via Kasabiare enlisted in this link52 Figure 12. GeoNames home page in Kasabi Platform for Linked Data hosting 51 http://kasabi.com/dataset/geonames 52 http://kasabi.com/browse/datasets/results/og_category%3A148