Proposed Linked Data Migration Framework for Singapore Government Datasets


Published on

Critical Inquiry Presentation on 'Designing a Linked Data Migrational Framework for Singapore Government Datasets'

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Dbpedia – Places and EventsCIA and World bank- Economic AnalysisFlickr – placesFAO – export and import commoditiesSupreme Court – Facts
  • Proposed Linked Data Migration Framework for Singapore Government Datasets

    1. 1. • Basics of Linked Data•• Purpose of this project• Migrational Framework • Eight Steps• Use Cases• Conclusion
    2. 2. Governments Enterprises Types of Data •Factual DataEntertainment •Transactional Data •Textual Data •Spatial Data Libraries & •Multimedia Museums •Files & Database Social Media Data Business (Blogs, Facebook) OPPORTUNITY OF LINKING DATA ACROSS VARIOUS DOMAINSAND TYPES
    3. 3. Mr.Brendan Luyt’s Associated publication search……. (Traditional Approach) (Linked Data Approach)Mr.Lee Kuan Yew! an exploration!..Others….
    4. 4. Linked Open Data Cloud (Web of Data)
    5. 5. Linked Open Data Cloud (Web of Data)
    6. 6. Data.Gov.SgiDA Singapore launched portal and mGov@SG public services during June provides 5000+ public data sets from 50 government agenciesPurpose: Building applications, research and for creating applications using the data
    7. 7. Accountant-Generals Department Accounting and Corporate Regulatory Authority Agency For Science, Technology & Research SG Government Data Eco System Attorney-General’s Chambers Building & Construction Authority Central Narcotics Bureau Central Provident Fund Board Civil Aviation Authority of Singapore Department of Statistics Economic Development Board STRUCTURED DATA Energy Market Authority Health Sciences Authority Housing & Development Board Ministry of Education Immigration & Checkpoints Authority Agency Websites Infocomm Development Authority of Singapore Inland Revenue Authority of Singapore TEXTUAL Ministry of Foreign Affairs Institute of Technical Education Intellectual Property Office of Singapore HTML UNSTRUCTURED DATA Ministry of Community Development, Youth & Sports JTC Corporation SG DATA SPATIAL Judiciary, Subordinate Courts Judiciary, Supreme Court DGS Eco System Ministry of Health Land Transport Authority API PDF Majlis Ugama Islam Singapura Ministry of Law –Community Mediation Unit Maritime & Port Authority of Singapore STATUTORY Media Development Authority BOARDS Monetary Authority of Singapore Singstat MINISTRIES publications Ministry of Manpower Nanyang Polytechnic National Environment Agency Ministry of Transport National Heritage Board National Library Board STRUCTURED DATA XLS National Parks Board Ngee Ann Polytechnic Peoples Association Public Service Division Public Transport Council Public Utilities Board Republic Polytechnic Sentosa Development Corporation Map-related APIs from various agencies Singapore Civil Defence Force Traffic-related APIs from Land Transport Authority Singapore Customs Tourism-related APIs from the Singapore Tourism Board Singapore Land Authority Singapore Police Force Environment-related APIs from the National Environment Agency CET Centers(Emp) Infocomm Access (C) Singapore Polytechnic WDA Service points(Emp) Silver infocomm (C) Library-related data feeds & web services from National Library Board Child care (F) Singapore Sports Council Disability (F) Wireless Hotspots (R) Singapore Workforce Development Agency Sports clubs (S) Elder care (F) Spring Singapore Family (F) Breast Screen (H) Temasek Polytechnic Family Friendly Estab (F) Kindergartens (Edu) Cervical Screen (H)Urban Redevelopment Authority C- Community Get TokenAddress Student Care (F) Healthier Dining (H) Heritage sites(Cul) Cul - Culture SearchAgency Data Comm Mediation Center (C) Quit Centers (H) Monuments(Cul) THEMES E- Environment CATEGORIES SearchStatic Map OPERATIONS Museums(Cul) BFABuildings(C) Emp- Employment Get Layer InfoMashup GreenBuilding(E) Edu - Education Get Related Data After Death Facilities (E) CD Councils (C) H- Health Get Directions Funeral Palours (E) Community Clubs (C) ABC Water Proj (R) F- Family Public Transportation Dengue Cluster (H) Constituency offices (C) R- Recreation Reverse Geocode Hawker Center (E) National Parks (R) NEA Offices (E) Other facilities (C) Skyrise greenery (E) S- Sports Recycling Bins (E) Other Pan networks (C) Waste Disposal Site (E) PA head quarters (C) Residents Committee(C) Libraries (Cul) Waste Treatment (E) Water Venture (C) Streets and Places(Cul)
    8. 8. Drawbacks of Existing Data Ecosystem•Siloed architecture•Absence of vocabulary standardization(common language)•Multiple data consumption end points•Steep learning curve for developers during application development process•Absence of interlinking between data sets Solutions to above identified drawbacks through Linked Data works at multiple levelsData Storage - Can support distributed storageData Representation - Common format(RDF) for both data and metadata.Data Consumption - via a single output terminal(SPARQL)Data Interlinking - Use of Ontologies (vocabularies) IDA can use Linked data on top of their traditional systems instead of going for a complete overhaul
    9. 9. UK Linked Data Implementation
    10. 10. Linked Data Representation Format RDF Subject-Predicate -Object Jurong belongs to the West Zone 0.21222 12.5555 Subject Predicate Object
    11. 11. Why are we doing this project?To prescribe a migrational framework for linked data (DGS) data sets First hand view of the required migration activities Issues anticipated at each step Evaluation & Recommendation on Linked Data toolsTo help IDA in understanding the benefits of Linked Data
    12. 12. Framework Formulation Process• Based on study of Linked Data Migration Research Papers and cookbooks published by the World Wide Web Consortium(W3C)• Analysis of Linked Data implementations in UK ,US and Brazil• Evaluation of Linked Data tools with Singapore data sets for recommendation in each step of the framework• Contemplating on probable issues that could be faced during implementation
    13. 13. Datasets Used for Framework Evaluation URA Sites for Sales dataset(Urban Planning) DOS Population and Household Characteristics dataset (Population Demographics) Age Pyramid of Resident Population Old Age Support Ratio
    14. 14. Proposed Linked Data Migrational Framework for DGS Allocation Allocation 10 Resource 15 Allocation Resource Govt Agencies and IDA Allocation 15 Govt Agencies Domain Resource Allocation Specification Identfication Analysis Matter Experts 5 Resource Allocation Ontology Modelers 20 Resource Object Modeling IDA and Web Architects T Objectives 5 PU Re-use Create Allocation IN Specifications Resource Project Duration Ontology Modeling Developers Allocation T PU Dataset Prioritization 15PROCESS IN Relational Model URI Naming Developers and Domain Resource Dataset License Setting S2R D2R A2R Dataset Overview 15 Experts T Impln Mode Selection PU Conceptual View RDF Creation Resource IN PROCESS Roadmap Drawing Objects in Public Vocabularies Developers T Architecture Whiteboard PU Re-use of Existing Resources OU IN Overview ER Model External Linking Web Architects PROCESS TP Class and Properties T Vocabularies PU UT Spreadsheets, IN Conceptual View Creation of New DBMS, API Vocabularies Visualization of URI Datasets Publication OU PROCESS TP 1 mining process T PU UT OWL, RDFS, RDF PROCESS Discovery & Exploitation IN Conversion to RDF triples Government and Vocabulary files OU using Mapping files external data sets TP T RDF Triples PU UT 2 URI Administration IN Ontologies PROCESS T URI Lifecycle OU Linking based on PU SPARQL, API Actual Data TP IN RDF Triples Similarity Algorithms UT Data Insertion Existing Apps OU 3 PROCESS TP VOID Modeling UT Gamification PROCESS Data Retrieval Outbound Links Crowdsourcing 4 OU API to SPARQL conversion Catalog Registration TP 5 UT VOID Triples External Reference JSON data New Apps OU OU TP TP UT UT 6 7 8
    15. 15. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Key Points Specification Home Identification AnalysisAddressing security concerns with licenses. •Understand database specifications•The Open Database License (ODbL) (relational model & ER model)•Open Data Commons Attribution License •Seven issues identified at data storage and•The Creative Commons Licenses consumption level Linked Data only(just Linked Data for files Linked Data +RDF URI linking) only(URIs for files) 1st level 2nd level Ideal for testing the URI Complete realization of Optional lifecycle Linked data and Semantic Web To improve the Decision on URI standards discovery of files in DGS Administration through semantic Decision to use this annotation Centralized(DGS) vs. mode can be taken Decentralized(Agency) after evaluation of POC
    16. 16. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Object Modeling Home This is modeling without usage context. *Requires normalization of database model in 3NF form Key LearningIssues Ease in identifying the use of commonPossibility of applying high abstraction and objects across data setshigh granularity to objects Facilitates brainstorming of relationships between objects
    17. 17. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Ontology Modeling HomeTakes the output conceptual diagram from Object Modeling as input.Key Impetus•Re-use of popular vocabularies (below table)•Use of STDTrip methodology for arriving at Ontologies for relational databases. Issues •Conflicting vocabulary in and OneMap Use Case Problem Statement •Different levels of granularity in datasets Consider an industrial entrepreneur (ex: Location in URA ‘Site for Sales’ dataset intending to buy a site from Urban Redevelopment Authority (URA) Predicate/Vocabularies Purpose rdfs:label and skos:prefLabel Naming things Geonames Model spatial data VoID Description Describe RDF schema or vocabulary vCard Describing address RDF, RDFS Model simple data
    18. 18. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Ontology Modeling HomeDate fields, location fields and fields related tomeasurements in DGS have scope forvocabulary re-useVocabulary for the identified data sets(developed using Protege) with screenshotsList of vocabularies required for LOGDimplementationList of tools used for ontology modelingOUTPUT?ALLOCATION PERCENTAGE?PERSONNEL INVOLVED
    19. 19. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Key Points URI Naming Home Uniform Resource Indicator (URI) is analogous to assignment of ip address to every computer Identified URI Administration Modes 1.) Maintained centrally in the DGS platform (resultant URIs will start with – RECOMMENDED 2.) Maintained by individual agencies (resultant URIs will start with or 3.) Maintained externally by third party platforms such as Kasabi (resultant URIs will start with ABOX TBOX Issues • Usage of different Linked Data tools can Dataset URIs hamper URI namingDataset ID URAstaticfile001Dataset URAstaticfile001/Class • Possibility of Dead linksProperty 1 1 - A generic column
    20. 20. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 RDF Creation Home Evaluated 3 tools for each mode of conversion - Google Refine, RDF Views and RDF Sponger Issues•Absence of intimation about API outages can cause the system to return null or invalid results•Google Refine doesn’t create URIs for each row in the static file•Changes to tables , API output done without appropriate changes in mapping files will affect RDF conversion
    21. 21. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 KeyPoints External Linking Home External Linking is connecting with other data sets in the web of data CIA World Supreme WorldBank Factbook DBpedia Flickr FAO Geonames Court <> <owl:sameAs> <> <> <owl:sameAs> <> Issues •The outbound links made to data sets outside of IDA’s purview can be risky •Dead links are a vivid possibility during the change of resource URIs or system downtime
    22. 22. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Key Points Datasets Publication Home SPARQL Query Select ?cc Where { Metadata Linked Data API call Triple Store LDA-SPARQL ?cc dgs:haszone dgs:north. Publication childcare/north Mapping file ?cc dgs:facilitytype dgs:childcare. } LIMIT 100 Datasets Publication Triple Store Linked Data Linked Data RDF Triples API Hosting JSON Output Http:// Entry: name1 Http:// Entry: name2 Conversion Http:// . from RDF to . . JSON . . . Entry: name100 Http:// Recommendations•Linked data hosting platforms arebest suited for open license Issuesdatasets(ex: Singstat publications) • Difficulty for Application developers - SPARQL does not currently support sub-queries, views, stored procedures etc•Use of APIs for updating RDF triplesinstead of SPARQL Update document • Inferencing is not possible with Linked Data API•Use of VOID generators for creating • Security implementation with 3rd party Linked Data hostingstatistics triples platforms.
    23. 23. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Discovery & Exploitation Home Key Theme 1.) Internal discovery within Singapore for local citizens 2.) External discovery for attracting usage of Singapore government data in international economic & political research and global issues(water scarcity, Carbon Footprint etc) • Internal Discovery can be improved by having different end points(SPARQL, API, Apps, RDF Dumps), creating awareness programs on availability of these data sets, employing crowdsourcing and gamification techniques to enhance visibility and utility of these data sets • External discovery is optional if IDA wishes to see the DGS system being limited to Singapore purview. External discovery can be initiated by registering the datasets in open government dataset portals(Potential candidates are datasets with Open license)
    24. 24. Interlinked Datasets Post-Migration Original data Possible because of provided the re-use of the by URA common resource Similarly, location based data from OneMap API is URI Pasir Ris across retrieved for Pasir Ris data sets
    25. 25. Other Interesting Use CasesQ & A Engine that works on top of government linked data. Inspired by Definitely not Science Fiction!
    26. 26. Sense-MakingQuestion: Which recent year had a growth rate close to 50% for majority of Singaporebased SME? Step1: Spot the resources in this query Dbpedia Spotlight does just that! – Semantic Information Extraction Which recent year had a growth rate close to 50% for majority of Singapore based SMEStep2: Identify the relationship between the resourcesSME is instance of the Organization class Organization class comes under Singapore countryGrowth rate is a property of Sales class Year is a class by itself Majority is subset of Group classStep3: Use NLP technique – Syntactic Analysis (Stanford Parser) followed by FocusExtraction for understanding the question Syntactic Parse tree is generated followed by Access PatternStep 4: Look for RDF triples that meet the criteria 2010 is retuned as the result!
    27. 27. Summary Object Modeling Concept MapFour in-persondiscussion sessions Ontology Modelingwith IDA, NIIT and SLA ProtégéAnalysis of systemspecifications URI NamingEvaluation of Four Pubbyexisting MigrationFrameworks RDF CreationPrototyping with Sixcore Linked Data Tools Google Refine RDF Views RDF Sponger External Linking SILK LIMES Dataset Publication Virtuoso Universal Server Linked Data API
    28. 28. Summary• Applicability of the framework to Singapore Government Data• Issues identified in existing Data Eco System• Recommended tools and best practices for each step• Launchpad for SG Linked Data implementation Final Thoughts…• ROI is not a key metric for Linked Data implementation• Benefits of moving to Linked Data is intangible and may not be immediately realizable• Volume of work is huge compared to traditional systems