SlideShare a Scribd company logo
1 of 35
Download to read offline
Exploiting the Canadian Health
     Census data as LOD




             Ahmad Chan
                3449014
         Topics in Web Science
                 CS3773
           Winter Term 2013
Motivation
●Government collected statistical data (census
data) contains important information.
●Can be exploited for needs assessment, to yield

new policies and for accountability .
●Emerging   trend to release the government
information all over the world.
●Inspired by www.data.govInspired by www.data.

gov www.data.gov.ukInspired by www.data.gov
www.data.gov.uk http://opendata.ie/
                                              *
*
*
*
*
Available open data in Canada
Http://www.opendata.ca
http://www.opendatabc.ca
http://www.openhalton.ca
http://openhamilton.ca
http://www.opendatalondon.ca
http://www.opendataottawa.ca
http://www.opendatawr.ca
http://gatineauouverte.org
http://montrealouvert.net
http://capitaleouverte.org
http://opendatask.ca
http://civicaccess.ca/

                                     *
Problem Statement
●Available Government data is unstructured and redundant
available as: text files, excel sheets and etc.

●Data analysis and to get comparative information is quite
challenging.

●Valuable information can be derived from health census data for
critical decision making.

●There is a need for instantly consumable datasets to encourage
the data usability.

●The interoperability, scalability and usability could not be
achieved with conventional data formats.
                                                              *
Detailed Goals for Project
Acquiring and refining the public health census data
●


●Transforming into W3C recommended flexible and interoperable
standard RDF (Resource Description Framework) format.

Integration of publicly available well known semantic vocabularies
●


Tuning the RDFized data according to LOD standards
●


Providing the graphical front end for querying (SPARQL endpoint)
●


Configuring the linked open data explorer
●


Hook it up with LOD cloud
●
                                                                     *
What is the Open Government
   Data (OGD)actually?




                              *
Some concepts and definitions
   Open data
   Open data is data which you can use more or less freely. It’s generally
available on the web, and uses non-proprietary formats like XML, CSV
and etc.
    Linked Data
    Linked data is data which contains links to other datasets. Generally
these will use URIs which are resolvable to discover more facts.
    RDF (Resource Description Framework)
    RDF is a useful data-structure for creating interoperable data. It has a
number of file formats for exchanging this data. Most common is
RDF/XML.
   Linked Open Data (aka LOD) is a common term, and as you can see is
usually going to be in RDF too. The key thing is not to get put off by the
linking. Add links when they provide value to your data and will help
people using your data (yourself included) do more with it.

                                                                         *
Some concepts and definitions




                                *
Methodology




              *
Data Acquisition Resources
Dataset                      Detail                                                                      Source

Breastfeeding Practices      Breastfeeding practices, by age group of mothers, recent mothers aged 15 to http://www.data.gc.ca
                             49, Canada and provinces


Breast Cancer Survival       Five-year survival estimates for breast cancer cases, by age group and sex, http://www.statcan.gc.ca/
                             population aged 15 to 99


Treatable Diseases Death     Deaths due to medically treatable diseases, by selected causes of death, http://www.statcan.gc.ca/
                             selected age groups and sex


Smoking Practices            Changes in smoking between 1994/1995 and 2010/2011, household
                             population aged 12 and over who reported on smoking every 2 years . http://www.data.gc.ca



Family Doctor Satisfaction   Patient satisfaction with most recent family doctor or other physician care http://www.statcan.gc.ca/
                             received in past 12 months


Kids Physical Activities     Children's participation in physical activities, in hours per week, by sex, http://www.statcan.gc.ca/
                             household population aged 6 to 11


Health Indicator             Health indicator profile, annual estimates, by age group and sex, Canada, http://www.data.gc.ca
                             provinces


                                                                                                                                     *
Data Manipulation
Data Prescreening
●
    Manual clean to acquire the quality data
    ●

Deep Data Cleaning
●
    Deleting/merging columns
    ●

Initial Transformation
●
    From Unstructured to relational
    ●

    ●   Tool Used
    Google refine (Desktop based version)
    ●




                                               *
RDF Foundry
Transformation of relational database to RDF
●
    ●Choosing the appropriate vocabularies
    ●Define your own vocabularies

    ●Programmical Mapping (D2R not maps according to

    your requirement)
    ●I tried D2R, Triplify and Virtuoso (all three famous

    tool), all have limitataions




                                                        *
Semantic Vocabularies Used
Ontology/ Vocabularies     Prefixes   Namespaces


FOAF: Friend Of A Friend   foaf       http://xmlns.com/foaf/0.1/


DBpedia Ontology           dbpedia    http://dbpedia.org/ontology/


Dublin Core                dc         http://purl.org/dc/elements/1.1/

Dublin Core Terms          dcterms    http://purl.org/dc/terms/


 SIOC Ontology             sioc       http://rdfs.org/sioc/ns#

SKOS ontology              skos       http://www.w3.org/2004/02/skos/core#


Time Ontology              time       http://www.w3.org/TR/owl-time/

Relationship Ontology      rel        http://vocab.org/relationship/

Biography Ontology         bio        http://vocab.org/bio/0.1/

Hc2lod Ontology            hc2lod     http://cbakerlab.unbsj.ca/ontologies/hc2lod.owl


                                                                                        *
RDBMS to RDF Mapping (a view)




                                *
Exposing &Integration
●At this stage, I configured Pubby and snorql
●Pubby is quite famous LOD explorer

●Snorql is the SPARQL end point for querying

●I uploaded the data files on CKAN which is

registry of LOD after getting permission from
LOD cloud admins.
●Setup a GUI dashboard




                                                *
Some Sample queries

●SPARQL Query 1: Show me the years and number of breast cancer patients
who were reported as survival patients among females between the ages of
40 to 49 years.


      SELECT DISTINCT ?year ?value WHERE {
      ?patient foaf:gender "Female".
      ?patient dbpedia:unitCost "Number of cases".
      ?patient dbpedia:statisticValue ?value.
      ?patient dbpedia:year ?year.
      ?patient foaf:age "40 to 49 years".
      ?patient rdf:type akt:person-being-visited.
      }
      ORDER BY DESC(?value)



                                                                       *
Some Sample queries
●SPARQL Query 2: Give me total number of breast feeding mothers
from New Brunswick province who have the ages between 20 to 24.
    SELECT count (*)
    Where {
    ?person dcterms:Location "New Brunswick".
    ?person rdf:type bio:immediatelyPrecedingEvent.
    ?person foaf:age "20 to 24 years".
     }




                                                                  *
Some Sample queries
Show the number of deaths reported due to Gallbladder and Prostate cancer among
●
male patients Canada wise during 2001 to 2003.


    SELECT DISTINCT ?death ?cancerType
    Where {
    ?person foaf:gender "Male".
    ?person dbpedia:part "Gallbladder".
    ?person dbpedia:part "Prostate".
    ?person dbpedia:statisticValue ?death.
    ?Cancer dbpedia:part ?cancerType.
    ?year dbpedia:year "2001 to 2003".
    ?person rdf:type akt:Knowledge-Lifecycle.
      }
    Limit 50

                                                                                  *
Some Sample queries
SPARQL Query 4: Display the age group among female individuals from New Brunswick province who has maximum practice in
smoking.




SELECT DISTINCT ?AgeGroup
Where{
?person rdf:type dbpedia:Activity.
?person foaf:gender "Female".
?person dcterms:Location "New Brunswick".
{
SELECT ?statval
Where{?person rdf:type dbpedia:Activity .
?person foaf:gender "Female" .
?person dcterms:Location "New Brunswick" .
?person dbpedia:statisticValue ?statval.
}
ORDER BY DESC(?statval)
limit 1
}
?person dbpedia:statisticValue ?statval.
                                                                                                                     *
?person foaf:age ?AgeGroup.
Demo Screenshots




                   *
Tools used

vocabulary publishing platform for the Web of Data


                                        ●   SNORQL
                                        ●   Pubby
                                        ●   Joeski
                                        ●   JENA
                                        ●    JSP




                                                     *
Tips, Tricks and Resources
●Appropriate available RDF vocabulary http:
//ws.nju.edu.cn/falcons/objectsearch/index.jsp
(Falcons Semantic engine)
●http://lov.okfn.org/dataset/lov/ (Linked open

vocabularies)
●http://datacatalogs.org/ (Worldwide open data

sets)
●http://virtuoso.openlinksw.

com/dataspace/doc/dav/wiki/Main/VOSRDF
(Easy tool for LOD AND Open data)
                                                 *
Conclusions
●Goal was to transform the raw health census to
LOD and its Linkage with LOD cloud.
●Demo page is vailable http://cbakerlab.unbsj.ca:
8080/hc2lod/index1.jsp
●SPARQL end point http://cbakerlab.unbsj.ca:
8080/hc2lod/snorql/
●CH2LOD explorer http://cbakerlab.unbsj.ca:
8080/hc2lod/
●ckAN data hub of LOD http://datahub.
io/dataset/ch2lod


                                                    *
Next Steps / Future Work
●Will extend with more data sets relating to
health domain
●Will try to define the LOD quality metrics

●will integrate the visualization tool with SPARQL

endpoint
●Will add an additional layer for LODD




                                                 *
Critical Commentary
●Availability of open data
●Mostly available health census data is

redundant and incomplete
●Unavailability of LOD logical schema builder

●There is not hard fast criteria to measure the

quality of data (provenance issue)
●Lacking of well known vocabularies which

match with your domain.

                                                  *
Interesting Facts




           Facts derived from Health census data
                                              *
Interesting Facts




           Facts derived from Health census data
                                              *
Interesting Facts




           Facts derived from Health census data
                                              *
Interesting Facts




                    *
References
1. Improving access to government through better use of the web (2009). URL
http://www.w3.org/TR/egov-improving/
2. C. Bizer, R. Cyganiak, T. Heath, How to publish linked data on the web. Retrieved February
11, 2013 from http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/
3. S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, D. Aumueller, Triplify: light-weight linked
data publication from relational databases. In: WWW ’09: Proceedings of the 18th interna-
tional conference on World wide web ACM, New York, NY, USA, (2009). Pp. 621–630.
4. C. Bizer, A. Seaborne, A, D2RQ-treating non-RDF databases as virtual RDF graphs (2004)
5. O. Erling, I. Mikhailov, Rdf support in the virtuoso DBMS. Networked Knowledge-
Networked Media, (2009). Pp. 7–24
6. J. Hendler, J. Holm, C. Musialek, G. Thomas, US Government Linked Open Data: Seman-
tic.data.gov, Intelligent Systems, (2012). 27 (3): pp. 25-31.
7. F. Zhichun, P. Christen, M. Boot, Automatic Cleaning and Linking of Historical Census Data
Using Household Information, IEEE 11th International Conference on Data Mining Work-
shops (ICDMW), (2011): pp. 413-420.
8. J. D. Fernández, M.A. Martínez-Prieto,C. Gutiérrez, Publishing open statistical data: the
Spanish census, Proceedings of the 12th Annual International Digital Government Research
Conference: Digital Government Innovation in Challenging Times, (2011): pp. 20-25



                                                                                                *
Thanks
Any Question?




                *

More Related Content

What's hot

Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Michel Dumontier
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge DiscoveryMichel Dumontier
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...marcosmartinezromero
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesMichel Dumontier
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Todd Vision
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Todd Vision
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...Todd Vision
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Ahmad C. Bukhari
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnTodd Vision
 

What's hot (20)

Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
 
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsAn Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
ICBO2017 - Supporting Ontology-Based Standardization of Biomedical Metadata i...
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description Guidelines
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
2016 bmdid-mappings
2016 bmdid-mappings2016 bmdid-mappings
2016 bmdid-mappings
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...
 
Phylogenetics: Making publication-quality tree figures
Phylogenetics: Making publication-quality tree figuresPhylogenetics: Making publication-quality tree figures
Phylogenetics: Making publication-quality tree figures
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 

Similar to Canadian health census to lod

Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-finalDATAVERSITY
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers Getaneh Alemu
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale Bernadette Hyland-Wood
 
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...Fiona Nielsen
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities Getaneh Alemu
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersKevin Lee
 
dkNET Introduction for Librarians
dkNET Introduction for LibrariansdkNET Introduction for Librarians
dkNET Introduction for LibrariansdkNET
 
Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsManuel Corpas
 
Domain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document RepositoriesDomain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document RepositoriesAastha Madaan
 
CQLD on health.data.gov @ SemTech 2011
CQLD on health.data.gov @ SemTech 2011CQLD on health.data.gov @ SemTech 2011
CQLD on health.data.gov @ SemTech 2011George Thomas
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Ian Fore
 

Similar to Canadian health census to lod (20)

Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-final
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...SciDataCon - How to increase accessibility and reuse for clinical and persona...
SciDataCon - How to increase accessibility and reuse for clinical and persona...
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
 
dkNET Introduction for Librarians
dkNET Introduction for LibrariansdkNET Introduction for Librarians
dkNET Introduction for Librarians
 
Finding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics DatasetsFinding and Accessing Human Genomics Datasets
Finding and Accessing Human Genomics Datasets
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Dive deep into your Data Pools
Dive deep into your Data PoolsDive deep into your Data Pools
Dive deep into your Data Pools
 
Domain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document RepositoriesDomain-specific Multi-stage Query Language for Medical Document Repositories
Domain-specific Multi-stage Query Language for Medical Document Repositories
 
CQLD on health.data.gov @ SemTech 2011
CQLD on health.data.gov @ SemTech 2011CQLD on health.data.gov @ SemTech 2011
CQLD on health.data.gov @ SemTech 2011
 
Converting GHO to RDF
Converting GHO to RDFConverting GHO to RDF
Converting GHO to RDF
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Wiser2009 Luis Martinez
Wiser2009 Luis MartinezWiser2009 Luis Martinez
Wiser2009 Luis Martinez
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019
 

More from Syed Ahmad Chan Bukhari, PhD

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...Syed Ahmad Chan Bukhari, PhD
 
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...Syed Ahmad Chan Bukhari, PhD
 
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized MetadataCEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized MetadataSyed Ahmad Chan Bukhari, PhD
 
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Syed Ahmad Chan Bukhari, PhD
 
A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discoverySyed Ahmad Chan Bukhari, PhD
 
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Syed Ahmad Chan Bukhari, PhD
 
CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR Workbench
CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR WorkbenchCAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR Workbench
CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR WorkbenchSyed Ahmad Chan Bukhari, PhD
 
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...Syed Ahmad Chan Bukhari, PhD
 
AN Intelligent Realtime multiple vessel collision risk assessment system
AN Intelligent Realtime multiple vessel collision risk assessment system AN Intelligent Realtime multiple vessel collision risk assessment system
AN Intelligent Realtime multiple vessel collision risk assessment system Syed Ahmad Chan Bukhari, PhD
 

More from Syed Ahmad Chan Bukhari, PhD (13)

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
 
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
Finding and Reusing Biomedical Datasets using CEDAR Metadata Repository and T...
 
CEDAR Technologies for AIRR Submissions
CEDAR Technologies for AIRR SubmissionsCEDAR Technologies for AIRR Submissions
CEDAR Technologies for AIRR Submissions
 
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized MetadataCEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata
CEDAR: Web-Based Tools for Accelerating the Creation of Standardized Metadata
 
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discovery
 
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
Cedar OnDemand: An intelligent browser extension to generate ontology-based m...
 
CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR Workbench
CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR WorkbenchCAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR Workbench
CAIRR: A pipeline to submit AIRR data to the NCBI through the CEDAR Workbench
 
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
BioNLP-SADI: A Suite of interoperable BioNLP Semantic Web Services based on S...
 
Type 2 fuzzy ontology ahmadchan
Type 2 fuzzy ontology ahmadchanType 2 fuzzy ontology ahmadchan
Type 2 fuzzy ontology ahmadchan
 
AN Intelligent Realtime multiple vessel collision risk assessment system
AN Intelligent Realtime multiple vessel collision risk assessment system AN Intelligent Realtime multiple vessel collision risk assessment system
AN Intelligent Realtime multiple vessel collision risk assessment system
 
Type-2 Fuzzy Ontology
Type-2 Fuzzy OntologyType-2 Fuzzy Ontology
Type-2 Fuzzy Ontology
 

Canadian health census to lod

  • 1. Exploiting the Canadian Health Census data as LOD Ahmad Chan 3449014 Topics in Web Science CS3773 Winter Term 2013
  • 2. Motivation ●Government collected statistical data (census data) contains important information. ●Can be exploited for needs assessment, to yield new policies and for accountability . ●Emerging trend to release the government information all over the world. ●Inspired by www.data.govInspired by www.data. gov www.data.gov.ukInspired by www.data.gov www.data.gov.uk http://opendata.ie/ *
  • 3. *
  • 4. *
  • 5. *
  • 6. *
  • 7. Available open data in Canada Http://www.opendata.ca http://www.opendatabc.ca http://www.openhalton.ca http://openhamilton.ca http://www.opendatalondon.ca http://www.opendataottawa.ca http://www.opendatawr.ca http://gatineauouverte.org http://montrealouvert.net http://capitaleouverte.org http://opendatask.ca http://civicaccess.ca/ *
  • 8. Problem Statement ●Available Government data is unstructured and redundant available as: text files, excel sheets and etc. ●Data analysis and to get comparative information is quite challenging. ●Valuable information can be derived from health census data for critical decision making. ●There is a need for instantly consumable datasets to encourage the data usability. ●The interoperability, scalability and usability could not be achieved with conventional data formats. *
  • 9. Detailed Goals for Project Acquiring and refining the public health census data ● ●Transforming into W3C recommended flexible and interoperable standard RDF (Resource Description Framework) format. Integration of publicly available well known semantic vocabularies ● Tuning the RDFized data according to LOD standards ● Providing the graphical front end for querying (SPARQL endpoint) ● Configuring the linked open data explorer ● Hook it up with LOD cloud ● *
  • 10. What is the Open Government Data (OGD)actually? *
  • 11. Some concepts and definitions Open data Open data is data which you can use more or less freely. It’s generally available on the web, and uses non-proprietary formats like XML, CSV and etc. Linked Data Linked data is data which contains links to other datasets. Generally these will use URIs which are resolvable to discover more facts. RDF (Resource Description Framework) RDF is a useful data-structure for creating interoperable data. It has a number of file formats for exchanging this data. Most common is RDF/XML. Linked Open Data (aka LOD) is a common term, and as you can see is usually going to be in RDF too. The key thing is not to get put off by the linking. Add links when they provide value to your data and will help people using your data (yourself included) do more with it. *
  • 12. Some concepts and definitions *
  • 14. Data Acquisition Resources Dataset Detail Source Breastfeeding Practices Breastfeeding practices, by age group of mothers, recent mothers aged 15 to http://www.data.gc.ca 49, Canada and provinces Breast Cancer Survival Five-year survival estimates for breast cancer cases, by age group and sex, http://www.statcan.gc.ca/ population aged 15 to 99 Treatable Diseases Death Deaths due to medically treatable diseases, by selected causes of death, http://www.statcan.gc.ca/ selected age groups and sex Smoking Practices Changes in smoking between 1994/1995 and 2010/2011, household population aged 12 and over who reported on smoking every 2 years . http://www.data.gc.ca Family Doctor Satisfaction Patient satisfaction with most recent family doctor or other physician care http://www.statcan.gc.ca/ received in past 12 months Kids Physical Activities Children's participation in physical activities, in hours per week, by sex, http://www.statcan.gc.ca/ household population aged 6 to 11 Health Indicator Health indicator profile, annual estimates, by age group and sex, Canada, http://www.data.gc.ca provinces *
  • 15. Data Manipulation Data Prescreening ● Manual clean to acquire the quality data ● Deep Data Cleaning ● Deleting/merging columns ● Initial Transformation ● From Unstructured to relational ● ● Tool Used Google refine (Desktop based version) ● *
  • 16. RDF Foundry Transformation of relational database to RDF ● ●Choosing the appropriate vocabularies ●Define your own vocabularies ●Programmical Mapping (D2R not maps according to your requirement) ●I tried D2R, Triplify and Virtuoso (all three famous tool), all have limitataions *
  • 17. Semantic Vocabularies Used Ontology/ Vocabularies Prefixes Namespaces FOAF: Friend Of A Friend foaf http://xmlns.com/foaf/0.1/ DBpedia Ontology dbpedia http://dbpedia.org/ontology/ Dublin Core dc http://purl.org/dc/elements/1.1/ Dublin Core Terms dcterms http://purl.org/dc/terms/ SIOC Ontology sioc http://rdfs.org/sioc/ns# SKOS ontology skos http://www.w3.org/2004/02/skos/core# Time Ontology time http://www.w3.org/TR/owl-time/ Relationship Ontology rel http://vocab.org/relationship/ Biography Ontology bio http://vocab.org/bio/0.1/ Hc2lod Ontology hc2lod http://cbakerlab.unbsj.ca/ontologies/hc2lod.owl *
  • 18. RDBMS to RDF Mapping (a view) *
  • 19. Exposing &Integration ●At this stage, I configured Pubby and snorql ●Pubby is quite famous LOD explorer ●Snorql is the SPARQL end point for querying ●I uploaded the data files on CKAN which is registry of LOD after getting permission from LOD cloud admins. ●Setup a GUI dashboard *
  • 20. Some Sample queries ●SPARQL Query 1: Show me the years and number of breast cancer patients who were reported as survival patients among females between the ages of 40 to 49 years. SELECT DISTINCT ?year ?value WHERE { ?patient foaf:gender "Female". ?patient dbpedia:unitCost "Number of cases". ?patient dbpedia:statisticValue ?value. ?patient dbpedia:year ?year. ?patient foaf:age "40 to 49 years". ?patient rdf:type akt:person-being-visited. } ORDER BY DESC(?value) *
  • 21. Some Sample queries ●SPARQL Query 2: Give me total number of breast feeding mothers from New Brunswick province who have the ages between 20 to 24. SELECT count (*) Where { ?person dcterms:Location "New Brunswick". ?person rdf:type bio:immediatelyPrecedingEvent. ?person foaf:age "20 to 24 years". } *
  • 22. Some Sample queries Show the number of deaths reported due to Gallbladder and Prostate cancer among ● male patients Canada wise during 2001 to 2003. SELECT DISTINCT ?death ?cancerType Where { ?person foaf:gender "Male". ?person dbpedia:part "Gallbladder". ?person dbpedia:part "Prostate". ?person dbpedia:statisticValue ?death. ?Cancer dbpedia:part ?cancerType. ?year dbpedia:year "2001 to 2003". ?person rdf:type akt:Knowledge-Lifecycle. } Limit 50 *
  • 23. Some Sample queries SPARQL Query 4: Display the age group among female individuals from New Brunswick province who has maximum practice in smoking. SELECT DISTINCT ?AgeGroup Where{ ?person rdf:type dbpedia:Activity. ?person foaf:gender "Female". ?person dcterms:Location "New Brunswick". { SELECT ?statval Where{?person rdf:type dbpedia:Activity . ?person foaf:gender "Female" . ?person dcterms:Location "New Brunswick" . ?person dbpedia:statisticValue ?statval. } ORDER BY DESC(?statval) limit 1 } ?person dbpedia:statisticValue ?statval. * ?person foaf:age ?AgeGroup.
  • 25. Tools used vocabulary publishing platform for the Web of Data ● SNORQL ● Pubby ● Joeski ● JENA ● JSP *
  • 26. Tips, Tricks and Resources ●Appropriate available RDF vocabulary http: //ws.nju.edu.cn/falcons/objectsearch/index.jsp (Falcons Semantic engine) ●http://lov.okfn.org/dataset/lov/ (Linked open vocabularies) ●http://datacatalogs.org/ (Worldwide open data sets) ●http://virtuoso.openlinksw. com/dataspace/doc/dav/wiki/Main/VOSRDF (Easy tool for LOD AND Open data) *
  • 27. Conclusions ●Goal was to transform the raw health census to LOD and its Linkage with LOD cloud. ●Demo page is vailable http://cbakerlab.unbsj.ca: 8080/hc2lod/index1.jsp ●SPARQL end point http://cbakerlab.unbsj.ca: 8080/hc2lod/snorql/ ●CH2LOD explorer http://cbakerlab.unbsj.ca: 8080/hc2lod/ ●ckAN data hub of LOD http://datahub. io/dataset/ch2lod *
  • 28. Next Steps / Future Work ●Will extend with more data sets relating to health domain ●Will try to define the LOD quality metrics ●will integrate the visualization tool with SPARQL endpoint ●Will add an additional layer for LODD *
  • 29. Critical Commentary ●Availability of open data ●Mostly available health census data is redundant and incomplete ●Unavailability of LOD logical schema builder ●There is not hard fast criteria to measure the quality of data (provenance issue) ●Lacking of well known vocabularies which match with your domain. *
  • 30. Interesting Facts Facts derived from Health census data *
  • 31. Interesting Facts Facts derived from Health census data *
  • 32. Interesting Facts Facts derived from Health census data *
  • 34. References 1. Improving access to government through better use of the web (2009). URL http://www.w3.org/TR/egov-improving/ 2. C. Bizer, R. Cyganiak, T. Heath, How to publish linked data on the web. Retrieved February 11, 2013 from http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/ 3. S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, D. Aumueller, Triplify: light-weight linked data publication from relational databases. In: WWW ’09: Proceedings of the 18th interna- tional conference on World wide web ACM, New York, NY, USA, (2009). Pp. 621–630. 4. C. Bizer, A. Seaborne, A, D2RQ-treating non-RDF databases as virtual RDF graphs (2004) 5. O. Erling, I. Mikhailov, Rdf support in the virtuoso DBMS. Networked Knowledge- Networked Media, (2009). Pp. 7–24 6. J. Hendler, J. Holm, C. Musialek, G. Thomas, US Government Linked Open Data: Seman- tic.data.gov, Intelligent Systems, (2012). 27 (3): pp. 25-31. 7. F. Zhichun, P. Christen, M. Boot, Automatic Cleaning and Linking of Historical Census Data Using Household Information, IEEE 11th International Conference on Data Mining Work- shops (ICDMW), (2011): pp. 413-420. 8. J. D. Fernández, M.A. Martínez-Prieto,C. Gutiérrez, Publishing open statistical data: the Spanish census, Proceedings of the 12th Annual International Digital Government Research Conference: Digital Government Innovation in Challenging Times, (2011): pp. 20-25 *