Ontology-based data access: why it is so cool!

Ontology-Based Data
Access: Why It is So Cool!
Josef Hardi
josef.hardi@stanford.edu
September 4, 2015
Ontology-Based Data Access is a concept developed by Diego Calvanese and
Mariano Rodriguez-Muro in KRDB Research Centre at Free University of Bozen-
Bolzano

Outline
● What is Ontology-based Data Access, or OBDA?
○ Motivation
○ System Black Box
○ Process Illustration
● Project -ontop- and Quest
● Experiment
○ Query Answering Performance
○ -ontop- vs Semantika
● Conclusion
● Q&A

Acknowledgement
Parts of the slides in this presentation are taken from
tutorial or lecture slides by:
Diego Calvanese,
Mariano Rodriguez-Muro, and
Martin Rezk

What is….
Ontology-based Data Access?

Think a scenario
Data Layer
Data Service
conceptual view
Image source: (various sources)
What is Ontology-based Data Access?

Data Access Bottleneck
Image source: Rezk, Martin. Ontologies Ontop Databases http://www.slideshare.net/MartnRezk/slides-swat4-ls
What is Ontology-based Data Access?

Query Answering
tbl_patient+2015
PatientId Name Cell_type cStage
1 Mary true 7
2 John false 6
3 Bill false 4
Cancer type is:
● NSCLC is when Cell_type is
false,
● SCLC is when Cell_type is
true.
Cancer stage is:
● I, II, III, IIIa, IIIb, IV for
NSCLC, corr. cStage: 1 - 6,
● Limited and Extensive for
SCLC, corr. cStage: 7 and 8.
There is “hidden logic” inside
the table that is specifically
used by the application. Not
for querying the data!

Query Answering
tbl_patient+2015
1 Mary true 7
2 John false 6
3 Bill false 4
Name cStage
John 6
Bill 4
RESULT
select Name, cStage
from tbl_patient+2015
where Cell_type = false
and cStage >= 4;

Can we do it better?
Show me all the patients’ name and stage
status that have large tumor with at least in
a minimum stage IIIa.
Query Answering

Bridge the semantics
tbl_patient+2015
1 Mary true 7
2 John false 6
3 Bill false 4
Cancer type is:
● NSCLC is when Cell_type is
false,
● SCLC is when Cell_type is
true.
Cancer stage is:
● I, II, III, IIIa, IIIb, IV for
NSCLC,
● Limited and Extensive for
SCLC.
hasStage
ISA
name
ISA
ISA
hasNeoplasm
SNOMED-CT
*SCLC = Small Cell Lung Cancer, NSCLC = Non-Small Cell Lung Cancer
Query Answering

OBDA Answering
● (Data) Sources: represents the external and independent
resources. Existing organization assets.
● Ontology: provides a unified common vocabulary. The
conceptual view of the underlying data
● Mappings: relates the terms in ontology to a set of SQL
views.
Image source: Rezk, Martin. Ontologies Ontop Databases http://www.slideshare.net/MartnRezk/slides-swat4-ls
Query Answering

OBDA Answering Black Box
● Rewriting: Create a new query which is the expanded
version of the original query, using all the defined
inclusion assertions in the ontology.
● Unfolding: Substitute each part in the expanded query
with corresponding SQL views from the given mappings.
● Evaluation: Execute the complete SQL to a target RDBMS.
Image source: Kontchakov, Roman, et.al. Ontology-based Data Access: Ontop of Databases. http://www.dcs.bbk.ac.uk/~roman/papers/ISWC13.pdf
Query Answering

OBDA Answering Illustration
Q: Show me all the Person in the hospital?
Q’: Show me
all the Person UNION
all the Nurse UNION
all the Doctor UNION
all the Patient UNION
anyone who has
Neoplasm in the hospital?
Rewritten

Look where is the source(s)
(No source)
Q’: Show me
all the Nurse UNION
all the Doctor UNION
all the Patient UNION
anyone who has
Neoplasm
in the hospital?
Get the list from table Nurse
Get the list from table Doctor
Get the list from table Patient
Get the list from table Cancer
Patient 2015
M
M
M
M
M

Substitute with SQL views
Q’: Show me
select NurseId from tbl_nurse UNION
select doc_id from tbl_doctor UNION
select pid from tbl_patient UNION
select PatientId from tbl_patient+2015
in the hospital?
Unfolded

Execute the SQL
select NurseId from tbl_nurse
UNION
select doc_id from tbl_doctor
UNION
select pid from tbl_patient
UNION
select PatientId from tbl_patient+2015
Evaluated

42!
(Computational) Price to Pay
Query answering in OBDA setting:
● PTIME in the size of ontology (efficiently
tractable)
● AC0
in the size of the data (very efficiently
tractable)
● NP-Complete in the size of query
(exponential)
*Tractable problem: there exists an algorithm that will eventually terminate in a
reasonable amount of time and return you the result.

-ontop- Project
● A platform to query relational databases using
SPARQL language,
● The implementation started in 2010,
● Supports several database systems, like: MySQL,
PostgreSQL, H2, SQL Server, Oracle, IBM DB2.
● Distributed under open-source license.
● It is currently being developed within the context of
EU Optique project.
● Fantastic add-ons: Efficient rewriting, Query
optimization, Transitive query, Rules entailment,
Cross-linked datasets.
-ontop-

-ontop- for Protege
http://ontop.inf.unibz.it/
-ontop-

Semantika Project
http://obidea.com/semantika/
Experiment

Berlin SPARQL Benchmark (BSBM)
● A benchmark suite built around e-commerce
domain.
○ A set of products is offered by different vendors and
customers are posting product reviews.
● Consists of 12 different queries, emulating
the search and navigation pattern of a
consumer looking for a product.
● A Query-Mix consists of 25 querying actions
that simulate a product search scenario.
● No inference.
Experiment

BSBM-100
● Dataset of 100 million triples,
● Transformed into relational db schema:
offer > 5.7 million rows
person > 147 thousand rows
producer > 5 thousand rows
product > 288 thousand rows
productfeature > 47 thousand rows
productfeatureproduct > 5.5 million rows
producttype > 2 thousand rows
producttypeproduct > 1.4 million rows
review > 2.8 million rows
vendor > 2 thousand rows
Experiment

Test Databases
● MySQL - v5.6
○ Vanilla
○ Optimized
■ CREATE INDEX
■ OPTIMIZE TABLE - ANALYZE
● PostgreSQL - v9.4.4
○ Vanilla
○ Optimized
■ CREATE INDEX
■ VACUUM TABLE - ANALYZE
Experiment

Test Machine
● MacBook Pro
○ OS X Yosemite 64-bit
○ Java 8 (build 1.8.0_51-b16)
○ Intel Core i7 3 GHz
○ Memory 16 GB
○ Flash storage
○ Direct connection - no network cost
Experiment

Benchmark Flow
for each obda-endpoint do:
for each dbms do:
for each dbms-variant do:
start endpoint;
start dbms;
loop 2:
run ‘benchmark -runs 100 -w 10’;
stop dbms;
stop endpoint;
Experiment

Conclusion
● OBDA offers a non-invasive solution to
existing (legacy) database system for
better data access service.
● A lot of interesting topics can be harvested
from OBDA use case scenarios.
○ Health and clinical domain perhaps?
● OBDA performance relies heavily on the
efficiency of the underlying data
infrastructure (both HW and SW).

Appendix:
Query Answering and
Query Rewriting

Query Answering over Database
Image source: Calvanese, Diego. Ontology-Based Data Access and Integration. https://www.essi.upc.edu/docs/slides-obda-2010-02-08

Query Answering over Ontology

Query Answering via Rewriting

-ontop- Black Box
Image source: Kontchakov, Roman, et.al. Ontology-based Data Access: Ontop of Databases. http://www.dcs.bbk.ac.uk/~roman/papers/ISWC13.pdf
● Tree witness rewriting technique
● T-mapping optimization
● Semantic Query Optimization (SQO)

Rule Entailment
Image source: Xiao, Guohui, et.al. Rules and Ontology-based Data Access. https://www.inf.unibz.it/~calvanese/papers/xiao-rezk-rodr-calv-RR-2014.pdf
● SWRL Rules to relational algebra, expressed in SQL’99
Common Table Expressions (CTEs)
● T-Mapping extension

Appendix:
Detailed Benchmark
Report

Query-Mixed per Hour
-ontop- Semantika Native
MySQL 807 831 436
MySQL optimized 1,471 1,630 2,371
PostgreSQL 2,198 2,286 418
PostgreSQL optimized 7,576 9,204 15,500

Query per Second - MySQL
Vanilla
Q1 Q2 Q3 Q4 Q5 Q7 Q8 Q9 Q10 Q11 Q12
-ontop- 1 95 1 -- 1 88 100 -- 75 -- --
Semantika 1 101 1 -- 1 77 112 -- 95 -- --
Q1 Q2 Q3 Q4 Q5 Q7 Q8 Q9 Q10 Q11 Q12
-ontop- 30 73 26 -- 1 48 63 -- 49 -- --
Semantika 58 99 46 -- 1 95 108 -- 102 -- --
Optimized

Query per Second - PostgreSQL
Vanilla
Q1 Q2 Q3 Q4 Q5 Q7 Q8 Q9 Q10 Q11 Q12
-ontop- 4 89 4 -- 2 73 77 -- 100 -- --
Semantika 4 90 4 -- 2 96 110 -- 123 -- --
Q1 Q2 Q3 Q4 Q5 Q7 Q8 Q9 Q10 Q11 Q12
-ontop- 75 77 79 -- 9 47 60 -- 76 -- --
Semantika 88 81 82 -- 9 94 110 -- 119 -- --
Optimized

Semantika does cache better
-ontop- Semantika
Trial 1 Trial 2 Delta% Trial 1 Trial 2 Delta%
MySQL 790 807 +2% 638 831 +30%
MySQL optimized 1424 1471 +3% 983 1630 +66%
PostgreSQL 1803 2198 +22% 1254 2286 +82%
PostgreSQL optimized 5678 7576 +33% 2028 9204 +354%

Ontop could answer ALL queries
Q1 Q2 Q3 Q4 Q5 Q7 Q8 Q9 Q10 Q11 Q12
-ontop- 83 80 78 112 9 75 78 83 105 91 83
Semantika 88 81 82 -- 9 94 110 -- 119 -- --
-ontop- supports almost all features in SPARQL 1.1

Appendix:
Comparison: Mapping
Syntax

-ontop- Mappings
mappingId Reviewer
target <"&bsbm-inst;dataFromRatingSite{$publisher}/Reviewer{$nr}"> a foaf:Person;
foaf:name $name; foaf:mbox_sha1sum $mbox_sha1sum; bsbm:country <"&iso3166;{$country}"
>; dc:publisher <"&bsbm-inst;dataFromRatingSite{$publisher}/RatingSite{$publisher}">; dc:date
$publishDate .
source select nr, name, mbox_sha1sum, country, publisher, publishDate from person
mappingId Producer
target <"&bsbm-inst;dataFromProducer{$nr}/Producer{$nr}"> a bsbm:Producer; rdfs:
label $label; rdfs:comment $comment; foaf:homepage $homepage; bsbm:country <"&iso3166;
{$country}">; dc:publisher <"&bsbm-inst;dataFromProducer{$nr}/Producer{$nr}">; dc:date
$publishDate .
source select nr, label, comment, homepage, country, publisher, publishDate from
producer
● Uses Turtle syntax.
● Specification: https://babbage.inf.unibz.
it/trac/obdapublic/wiki/ObdalibObdaTurtlesyntax
● Support R2RML syntax

Semantika Mappings
<mapping tml:id="Reviewer">
<logical-table rr:tableName="person"/>
<subject-map rr:class="foaf:Person" rr:template="Reviewer(publisher,nr)"/>
<predicate-object-map rr:predicate="foaf:name" rr:column="name"/>
<predicate-object-map rr:predicate="foaf:mbox_sha1sum" rr:column="mbox_sha1sum"/>
<predicate-object-map rr:predicate="bsbm:country" rr:template="Country(country)"/>
<predicate-object-map rr:predicate="dc:publisher" rr:template="ReviewerPublisher(publisher,publisher)"/>
<predicate-object-map rr:predicate="dc:date" rr:column="publishDate"/>
</mapping>
<mapping tml:id="Producer">
<logical-table rr:tableName="producer"/>
<subject-map rr:class="bsbm:Producer" rr:template="Producer(nr,nr)"/>
<predicate-object-map rr:predicate="rdfs:label" rr:column="label"/>
<predicate-object-map rr:predicate="rdfs:comment" rr:column="comment"/>
<predicate-object-map rr:predicate="foaf:homepage" rr:column="homepage"/>
<predicate-object-map rr:predicate="bsbm:country" rr:template="Country(country)"/>
<predicate-object-map rr:predicate="dc:publisher" rr:template="ProducerPublisher(nr,nr)"/>
<predicate-object-map rr:predicate="dc:date" rr:column="publishDate"/>
</mapping>
● Uses XML format.
● Specification: https://github.com/obidea/semantika/wiki/2.-Basic-RDB-RDF-
Mapping
● Support R2RML syntax

Appendix:
Comparison: SQL
Creation

Simple SPARQL Query
SELECT ?title ?publishDate
WHERE
{ ?review bsbm:reviewFor bsbm:Producer1245/Product62033> .
?review dc:title ?title .
?review dc:date ?publishDate .
}

Ontop SQL Creation
SELECT
3 AS `titleQuestType`, NULL AS `titleLang`, QVIEW1.`title` AS `title`,
10 AS `publishDateQuestType`, NULL AS `publishDateLang`, CAST
(QVIEW1.`publishDate` AS CHAR(8000) CHARACTER SET utf8) AS
`publishDate`
FROM review QVIEW1
WHERE
(QVIEW1.`product` = '62033') AND
(QVIEW1.`producer` = '1245') AND
QVIEW1.`publisher` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL AND
QVIEW1.`title` IS NOT NULL AND
QVIEW1.`publishDate` IS NOT NULL

Semantika SQL Creation
SELECT ÒBDA_VIEW1`.`title` AS `title`,
ÒBDA_VIEW1`.`publishDate` AS `publishDate`
FROM `bsbm100`.`review` AS ÒBDA_VIEW1`
WHERE ÒBDA_VIEW1`.`publisher` IS NOT NULL AND
ÒBDA_VIEW1`.`product` = 62033 AND
ÒBDA_VIEW1`.`publishDate` IS NOT NULL AND
ÒBDA_VIEW1`.`nr` IS NOT NULL AND
ÒBDA_VIEW1`.`title` IS NOT NULL AND
ÒBDA_VIEW1`.`producer` = 1245

Let’s add something more...
SELECT ?review ?title ?publishDate ?rating1 ?rating2
WHERE
{ ?review bsbm:reviewFor bsbm:Producer1245/Product62033> .
?review dc:title ?title .
?review dc:date ?publishDate .
?review bsbm:rating1 ?rating1 .
OPTIONAL { ?review bsbm:rating2 ?rating2 . }
}

Ontop SQL Creation
SELECT
1 AS `reviewQuestType`, NULL AS `reviewLang`, CONCAT('http://www4.wiwiss.fu-berlin.
de/bizer/bsbm/v01/instances/dataFromRatingSite', REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(CAST(QVIEW1.`publisher` AS CHAR
(8000) CHARACTER SET utf8),' ', '%20'),'!', '%21'),'@', '%40'),'#', '%23'),'$', '%24'),'&', '%26'),'*', '%42'), '(', '%28'), ')', '%29'), '[', '%5B'), ']', '%5D'),
',', '%2C'), ';', '%3B'), ':', '%3A'), '?', '%3F'), '=', '%3D'), '+', '%2B'), '''', '%22'), '/', '%2F'), '/Review', REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(CAST(QVIEW1.`nr` AS CHAR(8000) CHARACTER SET utf8),' ', '%20'),'!', '%21'),'@', '%40'),'#', '%23'),'$', '%24'),'&', '%26'),'*', '%42'), '(', '%28'),
')', '%29'), '[', '%5B'), ']', '%5D'), ',', '%2C'), ';', '%3B'), ':', '%3A'), '?', '%3F'), '=', '%3D'), '+', '%2B'), '''', '%22'), '/', '%2F')) AS `review`,
3 AS `titleQuestType`, NULL AS `titleLang`, QVIEW1.`title` AS `title`,
10 AS `publishDateQuestType`, NULL AS `publishDateLang`, CAST(QVIEW1.`publishDate` AS CHAR(8000) CHARACTER SET utf8) AS
`publishDate`,
4 AS `rating1QuestType`, NULL AS `rating1Lang`, CAST(QVIEW1.`rating1` AS CHAR(8000) CHARACTER SET utf8) AS `rating1`,
4 AS `rating2QuestType`, NULL AS `rating2Lang`, CAST(QVIEW2.`rating2` AS CHAR(8000) CHARACTER SET utf8) AS `rating2`
FROM (
review QVIEW1
LEFT OUTER JOIN review QVIEW2
ON (QVIEW1.`nr` = QVIEW2.`nr`) AND
(QVIEW1.`publisher` = QVIEW2.`publisher`) AND
QVIEW2.`rating2` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL
)
WHERE
QVIEW1.`title` IS NOT NULL AND
QVIEW1.`nr` IS NOT NULL AND
QVIEW1.`publishDate` IS NOT NULL AND
(QVIEW1.`product` = '62033') AND
QVIEW1.`rating1` IS NOT NULL AND
(QVIEW1.`producer` = '1245')

Semantika SQL Creation
SELECT CONCAT('http://www4.wiwiss.fu-berlin.
de/bizer/bsbm/v01/instances/dataFromRatingSite{1}/Review{2}',' : ','"',
ÒBDA_VIEW1`.`publisher`,'" "',ÒBDA_VIEW1`.`nr`,'"') AS `review`,
ÒBDA_VIEW1`.`title` AS `title`,
ÒBDA_VIEW1`.`publishDate` AS `publishDate`,
ÒBDA_VIEW1`.`rating1` AS `rating1`,
ÒBDA_VIEW1`.`rating2` AS `rating2`
FROM `bsbm100_optimized`.`review` AS ÒBDA_VIEW1`
WHERE ÒBDA_VIEW1`.`publisher` IS NOT NULL AND
ÒBDA_VIEW1`.`product` = 62033 AND
ÒBDA_VIEW1`.`publishDate` IS NOT NULL AND
ÒBDA_VIEW1`.`nr` IS NOT NULL AND
ÒBDA_VIEW1`.`title` IS NOT NULL AND
ÒBDA_VIEW1`.`rating1` IS NOT NULL AND
ÒBDA_VIEW1`.`producer` = 1245

Ontology-based data access: why it is so cool!

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Ontology-based data access: why it is so cool!

Similar to Ontology-based data access: why it is so cool! (20)

Recently uploaded

Recently uploaded (20)

Ontology-based data access: why it is so cool!