1. Exploiting the Canadian Health
Census data as LOD
Ahmad Chan
3449014
Topics in Web Science
CS3773
Winter Term 2013
2. Motivation
●Government collected statistical data (census
data) contains important information.
●Can be exploited for needs assessment, to yield
new policies and for accountability .
●Emerging trend to release the government
information all over the world.
●Inspired by www.data.govInspired by www.data.
gov www.data.gov.ukInspired by www.data.gov
www.data.gov.uk http://opendata.ie/
*
7. Available open data in Canada
Http://www.opendata.ca
http://www.opendatabc.ca
http://www.openhalton.ca
http://openhamilton.ca
http://www.opendatalondon.ca
http://www.opendataottawa.ca
http://www.opendatawr.ca
http://gatineauouverte.org
http://montrealouvert.net
http://capitaleouverte.org
http://opendatask.ca
http://civicaccess.ca/
*
8. Problem Statement
●Available Government data is unstructured and redundant
available as: text files, excel sheets and etc.
●Data analysis and to get comparative information is quite
challenging.
●Valuable information can be derived from health census data for
critical decision making.
●There is a need for instantly consumable datasets to encourage
the data usability.
●The interoperability, scalability and usability could not be
achieved with conventional data formats.
*
9. Detailed Goals for Project
Acquiring and refining the public health census data
●
●Transforming into W3C recommended flexible and interoperable
standard RDF (Resource Description Framework) format.
Integration of publicly available well known semantic vocabularies
●
Tuning the RDFized data according to LOD standards
●
Providing the graphical front end for querying (SPARQL endpoint)
●
Configuring the linked open data explorer
●
Hook it up with LOD cloud
●
*
10. What is the Open Government
Data (OGD)actually?
*
11. Some concepts and definitions
Open data
Open data is data which you can use more or less freely. It’s generally
available on the web, and uses non-proprietary formats like XML, CSV
and etc.
Linked Data
Linked data is data which contains links to other datasets. Generally
these will use URIs which are resolvable to discover more facts.
RDF (Resource Description Framework)
RDF is a useful data-structure for creating interoperable data. It has a
number of file formats for exchanging this data. Most common is
RDF/XML.
Linked Open Data (aka LOD) is a common term, and as you can see is
usually going to be in RDF too. The key thing is not to get put off by the
linking. Add links when they provide value to your data and will help
people using your data (yourself included) do more with it.
*
14. Data Acquisition Resources
Dataset Detail Source
Breastfeeding Practices Breastfeeding practices, by age group of mothers, recent mothers aged 15 to http://www.data.gc.ca
49, Canada and provinces
Breast Cancer Survival Five-year survival estimates for breast cancer cases, by age group and sex, http://www.statcan.gc.ca/
population aged 15 to 99
Treatable Diseases Death Deaths due to medically treatable diseases, by selected causes of death, http://www.statcan.gc.ca/
selected age groups and sex
Smoking Practices Changes in smoking between 1994/1995 and 2010/2011, household
population aged 12 and over who reported on smoking every 2 years . http://www.data.gc.ca
Family Doctor Satisfaction Patient satisfaction with most recent family doctor or other physician care http://www.statcan.gc.ca/
received in past 12 months
Kids Physical Activities Children's participation in physical activities, in hours per week, by sex, http://www.statcan.gc.ca/
household population aged 6 to 11
Health Indicator Health indicator profile, annual estimates, by age group and sex, Canada, http://www.data.gc.ca
provinces
*
15. Data Manipulation
Data Prescreening
●
Manual clean to acquire the quality data
●
Deep Data Cleaning
●
Deleting/merging columns
●
Initial Transformation
●
From Unstructured to relational
●
● Tool Used
Google refine (Desktop based version)
●
*
16. RDF Foundry
Transformation of relational database to RDF
●
●Choosing the appropriate vocabularies
●Define your own vocabularies
●Programmical Mapping (D2R not maps according to
your requirement)
●I tried D2R, Triplify and Virtuoso (all three famous
tool), all have limitataions
*
17. Semantic Vocabularies Used
Ontology/ Vocabularies Prefixes Namespaces
FOAF: Friend Of A Friend foaf http://xmlns.com/foaf/0.1/
DBpedia Ontology dbpedia http://dbpedia.org/ontology/
Dublin Core dc http://purl.org/dc/elements/1.1/
Dublin Core Terms dcterms http://purl.org/dc/terms/
SIOC Ontology sioc http://rdfs.org/sioc/ns#
SKOS ontology skos http://www.w3.org/2004/02/skos/core#
Time Ontology time http://www.w3.org/TR/owl-time/
Relationship Ontology rel http://vocab.org/relationship/
Biography Ontology bio http://vocab.org/bio/0.1/
Hc2lod Ontology hc2lod http://cbakerlab.unbsj.ca/ontologies/hc2lod.owl
*
19. Exposing &Integration
●At this stage, I configured Pubby and snorql
●Pubby is quite famous LOD explorer
●Snorql is the SPARQL end point for querying
●I uploaded the data files on CKAN which is
registry of LOD after getting permission from
LOD cloud admins.
●Setup a GUI dashboard
*
20. Some Sample queries
●SPARQL Query 1: Show me the years and number of breast cancer patients
who were reported as survival patients among females between the ages of
40 to 49 years.
SELECT DISTINCT ?year ?value WHERE {
?patient foaf:gender "Female".
?patient dbpedia:unitCost "Number of cases".
?patient dbpedia:statisticValue ?value.
?patient dbpedia:year ?year.
?patient foaf:age "40 to 49 years".
?patient rdf:type akt:person-being-visited.
}
ORDER BY DESC(?value)
*
21. Some Sample queries
●SPARQL Query 2: Give me total number of breast feeding mothers
from New Brunswick province who have the ages between 20 to 24.
SELECT count (*)
Where {
?person dcterms:Location "New Brunswick".
?person rdf:type bio:immediatelyPrecedingEvent.
?person foaf:age "20 to 24 years".
}
*
22. Some Sample queries
Show the number of deaths reported due to Gallbladder and Prostate cancer among
●
male patients Canada wise during 2001 to 2003.
SELECT DISTINCT ?death ?cancerType
Where {
?person foaf:gender "Male".
?person dbpedia:part "Gallbladder".
?person dbpedia:part "Prostate".
?person dbpedia:statisticValue ?death.
?Cancer dbpedia:part ?cancerType.
?year dbpedia:year "2001 to 2003".
?person rdf:type akt:Knowledge-Lifecycle.
}
Limit 50
*
23. Some Sample queries
SPARQL Query 4: Display the age group among female individuals from New Brunswick province who has maximum practice in
smoking.
SELECT DISTINCT ?AgeGroup
Where{
?person rdf:type dbpedia:Activity.
?person foaf:gender "Female".
?person dcterms:Location "New Brunswick".
{
SELECT ?statval
Where{?person rdf:type dbpedia:Activity .
?person foaf:gender "Female" .
?person dcterms:Location "New Brunswick" .
?person dbpedia:statisticValue ?statval.
}
ORDER BY DESC(?statval)
limit 1
}
?person dbpedia:statisticValue ?statval.
*
?person foaf:age ?AgeGroup.
26. Tips, Tricks and Resources
●Appropriate available RDF vocabulary http:
//ws.nju.edu.cn/falcons/objectsearch/index.jsp
(Falcons Semantic engine)
●http://lov.okfn.org/dataset/lov/ (Linked open
vocabularies)
●http://datacatalogs.org/ (Worldwide open data
sets)
●http://virtuoso.openlinksw.
com/dataspace/doc/dav/wiki/Main/VOSRDF
(Easy tool for LOD AND Open data)
*
27. Conclusions
●Goal was to transform the raw health census to
LOD and its Linkage with LOD cloud.
●Demo page is vailable http://cbakerlab.unbsj.ca:
8080/hc2lod/index1.jsp
●SPARQL end point http://cbakerlab.unbsj.ca:
8080/hc2lod/snorql/
●CH2LOD explorer http://cbakerlab.unbsj.ca:
8080/hc2lod/
●ckAN data hub of LOD http://datahub.
io/dataset/ch2lod
*
28. Next Steps / Future Work
●Will extend with more data sets relating to
health domain
●Will try to define the LOD quality metrics
●will integrate the visualization tool with SPARQL
endpoint
●Will add an additional layer for LODD
*
29. Critical Commentary
●Availability of open data
●Mostly available health census data is
redundant and incomplete
●Unavailability of LOD logical schema builder
●There is not hard fast criteria to measure the
quality of data (provenance issue)
●Lacking of well known vocabularies which
match with your domain.
*
34. References
1. Improving access to government through better use of the web (2009). URL
http://www.w3.org/TR/egov-improving/
2. C. Bizer, R. Cyganiak, T. Heath, How to publish linked data on the web. Retrieved February
11, 2013 from http://www4.wiwiss.fuberlin.de/bizer/pub/LinkedDataTutorial/
3. S. Auer, S. Dietzold, J. Lehmann, S. Hellmann, D. Aumueller, Triplify: light-weight linked
data publication from relational databases. In: WWW ’09: Proceedings of the 18th interna-
tional conference on World wide web ACM, New York, NY, USA, (2009). Pp. 621–630.
4. C. Bizer, A. Seaborne, A, D2RQ-treating non-RDF databases as virtual RDF graphs (2004)
5. O. Erling, I. Mikhailov, Rdf support in the virtuoso DBMS. Networked Knowledge-
Networked Media, (2009). Pp. 7–24
6. J. Hendler, J. Holm, C. Musialek, G. Thomas, US Government Linked Open Data: Seman-
tic.data.gov, Intelligent Systems, (2012). 27 (3): pp. 25-31.
7. F. Zhichun, P. Christen, M. Boot, Automatic Cleaning and Linking of Historical Census Data
Using Household Information, IEEE 11th International Conference on Data Mining Work-
shops (ICDMW), (2011): pp. 413-420.
8. J. D. Fernández, M.A. Martínez-Prieto,C. Gutiérrez, Publishing open statistical data: the
Spanish census, Proceedings of the 12th Annual International Digital Government Research
Conference: Digital Government Innovation in Challenging Times, (2011): pp. 20-25
*