Proposal for Designing a Linked Data Migrational Framework for Singapore Government Data sets


Published on

Proposal for Designing a Linked Data Migrational Framework for Singapore Government Data sets

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Proposal for Designing a Linked Data Migrational Framework for Singapore Government Data sets

  1. 1. NANYANG TECHNOLOGICAL UNIVERSITY Wee Kim Wee School of Communication & InformationK6299 – Critical Inquiry in Knowledge ManagementProposal for Designing a Linked Data Migrational Framework for SingaporeGovernment Data Sets Under the guidance of Dr.KhooSoo Guan, Christopher (Assoc Prof) Submitted by SESAGIRI RAAMKUMAR ARAVIND (G1101761F) THANGAVELU MUTHU KUMAAR (G1101765E) KALEESWARAN SUDARSAN (G1001065F) Page 1 of 9
  2. 2. Introduction“The Internet is becoming the town square for the global village of tomorrow” – This quote of Bill Gates,Chairman of Microsoft rightly pictures the world’s present business scene using internet as the dominantmedium for connecting with its resources across geographies enabling voluminous transactions at ease.The challenge now vests upon enabling machines to read and understand data on the internet for a chainof intelligent transactions that has been manual earlier due to thehuman understandable format in thetraditional form of WWW. This idea was well formulated with the concept of Semantic Web that hascontent defined with semantics(Berners-Lee, Hendler&Lassila, 2001). Based on the concept, principlesdescribing Linked Data were released to guide individuals, enterprises and public bodies to release theirdata in a common standard, RDF (Resource Description Framework) to form a web of data (Berners-Lee,2006).Standardised data representation provides more scope for interlinking data sets across domains,creating avenues for multi-point usage and knowledge discovery with intelligent software applicationsbuilt over it.The most interesting large scale application of Linked Data taken for exploration is the eGovernment(eGov) initiatives of US, UK and many other nations to publish their Open Governmental Data (OGD)pertaining to governance and public affairs for transparency and value co-creation to empower peoplewith appropriate knowledge. The recent Open Government Partnership1 mandates nations to publish theirOGD in linked data format. Many nations have started to publish their data in the form of linked data, thelatest being Brazil data portal The start of the Linked data movement spurred the release ofnew data sets highlighted by the LOD cloud3 maintained by CKAN4 registry.US and UK governmentshave realized the benefits by releasing selective data sets in the linked data format in the portals data.gov5and respectively. Well-defined relationships between these datasets and ready-madeapplications guide public’s daily activities related to transport, business and other needs. Some of theexisting applications are Numberhood7, FixMyTransport8, BIS Research Funding Explorer9, SemaPlorer10and “Linking Wildland Fire and Government Budget” mashup11.1 Open Government Partnership Brazil Data Portal LOD cloud diagram shows datasets that have been published in Linked Data format, by contributors to the LinkingOpen Data community project and other individuals and organisations Comprehensive Knowledge Archive Network data.gov6 http://www.Numberhood.net8 Page 2 of 9
  3. 3. The current OGD scenario in Singapore doesn’t make use of Linked Data standards. This proposal aimsat suggesting a migrational framework from the existing system of data publishing. A study is being doneon the current OGD ecosystem in Singapore as a starting point. iDA12 maintains the portal handles data collated from different government agencies (CheeHean, 2011). The data portal aims tomeet Singapore public’s data needs and also to establish a co-creative environment. The data is providedin different structured and unstructured formats such as txt, excel, pdf, xml, webpages, maps and also inthe form of agency specific Application Programming Interfaces (APIs) and web services. There aremultiple endpoints for data consumption. Prominent examples include, OneMap API14,Singapore Statistics15,mytransport.sg16 and Integrated Land Information Services17. There is some level ofredundancy in data spanning across the different sources in the current OGD ecosystem with limitedinterlinking and re-use capabilities. The vocabularies used by the agencies are specific to their own withlimited standardisation of commonly used terms. The process of building a mash-up applicationleveraging data across agencies is complex. This study has indicated the scope for the application oflinked data as it requires standardised data representation at source level and common interface atpublication level with the data sets linked by interconnected vocabularies. Fig1: Linked Data implementation over current DGS (DATA.GOV.SG) Ecosystem10 Infocomm Development Authority of Singapore (iDA) http://www.onemap.sg15 http://mytransport.sg17 Page 3 of 9
  4. 4. Objectives of the ProposalThe current study aims to build a linked data migrational framework that could be used by iDA andSingapore Government agencies to publish their data sets in the form of linked data to the public. Amulti-step methodology would be devised with clearly defined activities and deliverables at each stepbased on the current ecosystem of and other OGD publishing portals in Singapore.Geographical and Statistical data have been selected for describing each step in the framework.The framework build process is based on the metadata and specifications provided by iDA andgovernment agencies. The current study focuses on linking the internal data sets. Additionally, it aims toprovide recommendations on a few use-cases that leverage the utility of external linked data. The holisticnature of the framework will be validated with Geographical and Statistics data provided by SLA andDOS.Other objectives of the study are as follows:- 1.) Explore case studies pertaining to implementation of Linked Open Government data 2.) Prepare an inventory by assessing different linked data tools, technical frameworks and processes 3.) Provide recommendations for linked data implementation as per nature of the government agency. 4.) Build an Ontology Network model (Haase, Rudolph, Wang et al, 2006) meant to unify vocabularies from different agency domains. 5.) Build a POC application based on the devised methodology to validate its applicability. This objective is subject to availability of sufficient time and infrastructure.The migrational framework will be useful for iDA in formulating their Linked Data implementationstrategy in the near future, as the government body intends to make the portal as a cornerstoneportal for OGD publication. The common output interface suggested by the framework will showcase thepotential of unifying the different end points provided by the agencies thereby simplifying access andfacilitating the creation of applications that integrate data from disparate sources. The ontology networksuggested by the framework will help the agencies in standardising vocabulary across domains for betterunderstanding their data and its relation to data from other agencies.The framework can also be used by enterprises and individuals to understand the steps, tools andprocesses involved in releasing their data to the WWW in the form of linked data. Page 4 of 9
  5. 5. Literature ReviewThe Semantic Web facilitates a web of data18 that works on top of URI19 RDF20, Ontology21 andSPARQL22 concepts. Resources and values are identified and described in a common standard, RDFbased on the modelled Ontology specifying the relationships (Berners-Lee, Hendler&Lassila, 2001). TheLOD223 initiative aims to build a LOD stack of products, frameworks and processes that aim to acceleratethe implementation of linked data across the globe.W3C has setup two committees24 to provide bestpractices and recommendations for governments to publish their OGD in standardised linked data format.(Bizer, Heath, Idehen& Berners-Lee, 2008), (Villazón, Vilches, Corcho& Gómez-Pérez, 2011) and(Hyland & Wood, 2011) provide cookbooks and guidelines for OGD conversion to Linked Data format.They are helpful in understanding the general steps and tools required in converting and publishing OGDin Linked Data format. Governments that are new entrants in adopting Linked Data publication strategyneed a tailored migrational framework specific to the local OGD ecosystem. The customized frameworkcould be used by the government steering committee to expedite the migration to LOGD format.MethodologyThe project team has been discussing with iDAstaff, SLA staff and NIIT staff (the IT vendor supportingDGS25 platform) prior to the proposal to get a basic understanding of the current architecture and toidentify the DGS components that could accommodate changes as a part of this study.Primary data wouldbe provided by iDA and SLA. The data sets selected for the study are indicated in the below table1.1.These seemingly disparate datasets can be connected to give a context specific knowledge oneach site for the prospective tenderers to gain insights on the consumer and locality trends basedon the demographics.18 Linked Data and Web of Data Uniform Resource Identifiers (URIs) are short strings that identify resources in the web: documents, images,downloadable files, services, electronic mailboxes, and other resources. They make resources available under avariety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the samesimple way RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if theunderlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all thedata consumers to be changed Ontologies or vocabularies define the concepts and relationships (also referred to as “terms”) used to describe andrepresent an area of concern. SPARQL is an RDF query language; its name is an acronym that stands for SPARQL Protocol and RDF QueryLanguage. LOD2 Project and DGS – data store Page 5 of 9
  6. 6. Data set Agency Category Data typeResident Population by DGP Zone/ Department of Population and TextualSubzone and Age Group, Type of Statistics HouseholdDwelling, Ethnic Group CharacteristicsSites Sold by URA - Details Urban Redevelopment Housing and Urban Textual Authority (URA) Planning Table 1.1: Primary datasets used for the studyThe entire data sets would not be used for the study instead the latest year’s data would be used for thestudy. The secondary data for the research study would be extracted from LOGD statistical and geospatialdata sets from the portal thedatahub.orgfor building the framework. The migrational framework willbecustomizedbased on the current architecture of DGSbecause the steps will be devised based on theunderstanding of the different layers in DGS and still the framework will be generic enough to beapplicable for other cases. The project team would be conducting interviews with iDA support staff forcollecting specification documents and insights relevant to the current architecture of DGS.The framework formulation would be based on the context-specific integration of different approachesput forth by LOGD activists, researchers and practitioners.Each step in the framework will be sequential,comprising of sub steps covering intrinsic activities. For example, object modelling of the different dataobjects in the selected data sets is a step that precedes the RDF modelling and Ontology/Vocabularybuilding steps.The steps will be substantiated with sample implementations using the primary data.Suggestions from W3C LOGD steering groups10 will be taken into account for framework formulation.The tools that will be identified as part of the inventory will be used for the activities such as RDFcreation, RDF storage and Ontology re-use/modelling in the framework.Difficulties and IssuesAgencies do not provide raw data to iDA. Aggregated report data is split into X dimensions representingcolumns, Y dimensions representing rows and data points representing cells. These fields are provided inan XML file and sent to iDA on a periodic basis. There is no separate master data file. The hierarchy inmaster data dimensions is not explicitly set or provided. Therefore, a mechanism to identify the masterdata and the relationship between different levels in the master data dimensions needs to be devised. Thismechanism may not serve as a generic transformation applicable for all agencies due to the implicit natureof data representation in the files. Page 6 of 9
  7. 7. The data conversion to RDF formats will not be done at the agency level instead it will be done on top ofthe data model in iDA data store. This leads to data duplication as the data is converted to RDF format forLinked data implementation.There isno master data management system in place right now that standardises the dimension valuesacross agencies. Standardisation is required to link common data in the data sets used in the study. Thismight be a complex task due to the different versions of master data values in a single data set and alsoacross data sets.The current OGD ecosystem of Singapore provides multiple end points to the users such as API, webservices and files. A common endpoint in the form of Linked data API would mean building differentwrappers over the end points. The below diagram from (Bizer, Heath, Idehen, & Berners-Lee, 2008)illustrates the different approaches of linked data implementation over existing systems. Fig2: Different Linked Data Implementation Approaches Page 7 of 9
  8. 8. ScheduleThe schedule for the study is covered in the embedded Gantt chart. Gantt Chart-iDALinked Data Project.xlsxProposed Report OutlineThe proposed final report will be structured in the following format. 1. Abstract 2. Introduction a. Introduction to Linked Data and its relevance to Open Government Data and eGov b. Overview of SG OGD Ecosystem 3. Literature Review a. Government Linked Data Implementation Cookbooks, Guidelines and Recommendations i.URI formulation ii.RDF creation iii.Ontology Formulation iv.Publication and Exploitation 4. Migrational Framework a. Multi-step methodology i.Formulation and Description ii.Examples 5. Implementation Results and Observations a. POC details b. Description of issues faced in implementation 6. Limitations 7. Conclusion and RecommendationsFew new sections and sub-sections might be added in the final report.Dissemination of ResultsThe migrational framework will be published in the form of a report subject to review by NTU Supervisorfollowed by submission to iDA. The researchers plan to publish the report in the form of a conferencepaper in the later part of the year. Page 8 of 9
  9. 9. References Berners-Lee, T., Hendler, J., &Lassila, O. (2001).THE SEMANTIC WEB.Scientific American, 284(5), 34Berners-Lee, T. (2006).Linked Data. Available: Last accessed 11th Jan 2012 CheeHean, T. (2011).Keynote Address by Mr TeoCheeHean, Deputy Prime Minister, Coordinating Minister for National Security and Minister for Home Affairs at the e-Gov Global Exchange 2011. Available: Last accessed 11th Jan 2012 Bizer , C., Heath, T., Idehen, K., & Berners-Lee, T. (2008). Linked Data: Evolving the Web into a Global Data Space.(J. Hendler& F. Van Harmelen, Eds.)Proceeding of the 17th international conference on World Wide Web WWW 08 (Vol. 1, p. 1265).ACM Press. Villazón-Terrazas, B., Vilches-Blázquez, L., Corcho, O., and Gómez-Pérez, A. (2011). Methodological guidelines for publishing government linked data linking government data. In Wood, D., editor, Linking Government Data, chapter 2, pages 27-49. Springer New York, New York, NY. Hyland, B. and Wood, D. (2011).The joy of data - a cookbook for publishing linked government data on the web linking government data. In Wood, D., editor, Linking Government Data, chapter 1, pages 3- 26. Springer New York, New York, NY. Haase, P., Rudolph, S., Wang, Y., Brockmans, S., Palma, R., Euzenat, J., & d’ Aquin, M. (2006, November). Networked Ontology Model. Technical Report, NeOn project deliverableD1.1.1 Page 9 of 9