Big 
data, 
Seman-c 
Web 
and 
Ontologies 
Mélanie 
Courtot, 
PhD 
Nov 
12th 
2014 
mcourtot@sfu.ca 
1
About 
me 
2
Overview 
3 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
Overview 
4 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
5
Big 
data 
Big 
data 
is 
data 
that 
is 
too 
large 
and 
complex 
to 
process 
for 
any 
convenHonal 
data 
tools. 
6
7 
2005
8 
2013
What 
is 
a 
Ze^abyte? 
1,000,000,000,000 
gigabytes 
1,000,000,000,000 
terabytes 
1,000,000,000,000 
petabytes 
1,000,000,000,000 
exabytes 
1,000,000,000,000 
zeAabyte 
9
How 
big 
is 
big? 
• Facebook: 
25 
Terabytes 
of 
logged 
data 
per 
day, 
Google 
(2008): 
20 
Petabytes 
per 
day 
• Over 
90% 
of 
all 
the 
data 
in 
the 
world 
was 
created 
in 
the 
past 
2 
years 
[1] 
• Today 
3.2 
ze^abytes. 
2020: 
40 
zeAabytes.[2] 
• Good 
news: 
jobs! 
[3] 
1. http://www-01.ibm.com/software/data/bigdata/ 
2. http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/ 
10 
3. http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html
11 
h^ps://hbr.org/2012/10/data-­‐scienHst-­‐the-­‐sexiest-­‐job-­‐of-­‐the-­‐21st-­‐century
12 
Issues 
with 
research 
data 
(1): 
data 
availability 
h^p://www.nature.com/news/scienHsts-­‐losing-­‐data-­‐at-­‐a-­‐rapid-­‐rate-­‐1.14416
Issues 
with 
research 
data 
(2): 
data 
reproducibility 
h^p://www.firstwordpharma.com/node/931605#axzz3IalL2lzU 
13
Overview 
14 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• Seman-c 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
A 
soluHon: 
the 
SemanHc 
Web 
"The 
Seman*c 
Web 
is 
an 
... 
extension 
of 
the 
current 
web 
in 
which 
... 
informa*on 
is 
given 
well-­‐defined 
meaning, 
... 
be?er 
enabling 
computers 
and 
people 
to 
work 
in 
coopera*on.” 
The 
Seman)c 
Web 
Tim 
Berners-­‐Lee, 
James 
Hendler 
and 
Ora 
Lassila 
ScienHfic 
American, 
May 
2001 
http://www.scientificamerican.com/article/the-semantic-web/15
The 
SemanHc 
Web 
in 
a 
nutshell 
Adds 
to 
Web 
standards 
and 
prac*ces 
(currently 
only 
for 
documents 
and 
services) 
encouraging 
• Unambiguous 
names 
for 
things, 
classes, 
and 
relaHonships 
• Well 
organized 
and 
documented 
in 
ontologies 
• With 
data 
expressed 
using 
uniform 
knowledge 
representaHon 
languages 
(e.g. 
OWL) 
• To 
enable 
computaHonally 
assisted 
exploitaHon 
of 
informaHon 
• That 
can 
be 
easily 
integrated 
from 
different 
sources 
16
Some 
SemanHc 
Web 
successes 
• In 
February 
2011, 
the 
Watson 
system 
by 
IBM 
made 
internaHonal 
headlines 
for 
beaHng 
the 
best 
humans 
in 
the 
quiz 
show 
Jeopardy! 
• A 
significant 
number 
of 
very 
prominent 
websites 
are 
powered 
by 
Seman-c 
Web 
technologies, 
including 
the 
New 
York 
Times, 
Thomson 
Reuters, 
BBC, 
and 
Google's 
Freebase. 
• The 
Speech 
Interpreta-on 
and 
Recogni-on 
Interface 
Siri 
launched 
by 
Apple 
in 
2011 
as 
an 
intelligent 
personal 
assistant 
for 
the 
new 
generaHon 
of 
IPhone 
smartphones 
heavily 
draws 
from 
work 
on 
ontologies, 
knowledge 
representaHon, 
and 
reasoning. 
17 
h^p://130.108.5.60/faculty/pascal/pub/crc-­‐handbook-­‐13.pdf
18
Uniform 
Resource 
IdenHfiers 
(URIs) 
• Two 
different 
uses: 
– Unambiguous 
name 
for 
something 
– LocaHon 
of 
a 
document 
• Examples: 
– h^p://example.org/wiki/Main_Page 
– sp://example.org/resource.txt 
– mailto:someone@example.com 
19
Resource 
DescripHon 
Framework 
(RDF) 
• Resources (= nodes) 
• Identified by Unique Resource Identifier (URI) 
• Properties (= edges) 
• Identified by Unique Resource Identifier (URI) 
• Binary relations between 2 resources 
20 
h^p://elmonline.ca/sw/sparql/social.^l
<h^p://www.linkedin.com/in/mcourtot> 
a 
foaf:Person 
; 
foaf:name 
"Melanie 
Courtot" 
; 
foaf:knows 
<h^p://elmonline.ca/luke> 
; 
foaf:knows 
<h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665> 
. 
21
SPARQL 
SELECT 
?person 
WHERE 
{ 
<h^p://www.linkedin.com/in/mcourtot> 
<h^p://xmlns.com/foaf/0.1/knows> 
?person 
. 
} 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
| 
person 
| 
========================================================== 
| 
h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665 
| 
| 
<h^p://elmonline.ca/luke> 
| 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
• An 
excellent 
tutorial 
by 
Luke 
McCarthy: 
h^p://elmonline.ca/sw/sparql/ 
22 
A 
query 
language 
for 
RDF
The 
Web 
Ontology 
Language 
(OWL) 
• Knowledge 
representaHon 
language 
• Based 
on 
DescripHon 
Logics: 
fragments 
of 
First-­‐Order 
logics 
with 
decidable 
and 
defined 
computaHonal 
properHes 
• Sound, 
complete, 
terminaHng 
reasoners 
available 
23
Overview 
24 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• Seman-c 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
Linked 
open 
data 
cloud 
25
Biological 
resources 
in 
LOD 
26
Examples 
of 
issues 
in 
linking 
data 
incorrectly 
• h^p://dbpedia.org/resource/Welsh 
OWL:sameAs 
<h^p://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh> 
<h^p://sw.cyc.com/2006/07/27/cyc/Welsh-­‐TheWord> 
<h^p://sw.cyc.com/2006/07/27/cyc/WelshLanguage> 
<h^p://sw.cyc.com/2006/07/27/cyc/Welshing-­‐Chea-ng> 
27
Overview 
28 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– Defini-on 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
Ontologies 
• RepresentaHon 
of 
important 
things 
in 
a 
specific 
domain 
– Describes 
types 
of 
enHHes 
(e.g. 
cells) 
and 
relaHons 
between 
them 
(e.g. 
prokaryoHc 
cells 
and 
eukaryoHc 
cells 
are 
cells) 
and 
their 
instances 
(e.g. 
the 
specific 
cells 
in 
my 
sample) 
• An 
acHve 
computaHonal 
arHfact 
– A 
mathemaHcal 
model 
based 
on 
a 
subset 
of 
first 
order 
logic 
– Tools 
can 
automaHcally 
process 
ontologies 
• A 
communicaHon 
tool 
– Provides 
a 
dicHonary 
for 
collaborators, 
a 
shared 
understanding 
– Allows 
data 
sharing 
29
Reasoning 
is 
criHcal 
• ProkaryoHc 
and 
EukaryoHc 
cell 
are 
declared 
disjoints 
• Fungal 
cell 
is 
a 
EukaryoHc 
cell 
• Spore 
is 
a 
Fungal 
cell 
and 
a 
ProkaryoHc 
cell 
⇒ InsaHsfiability 
⇒ SoluHon: 
clarify 
spore 
(sensu 
Mycetozoa) 
AND 
acHnomycete-­‐type 
spore 
h^p://www.plosone.org/arHcle/info:doi/10.1371/journal.pone.0022006 
30
Logics 
• Simple 
example 
based 
on 
h^p://arxiv.org/pdf/1201.4089v1.pdf 
• Ontology 
file 
available 
from 
h^p://www.sfu.ca/~mcourtot/course/ 
20141112BigDataSemWebOntologies/ 
ontology.owl 
• ManipulaHon 
done 
using 
Protégé: 
h^p://protege.stanford.edu 
31
Family 
ontology 
32
Logics 
of 
a 
grandfather 
33
Reasoning 
34
Inferred 
class 
hierarchy 
35
Explana-ons 
36
A 
wrong 
asser-on 
37
Unsa-sfiability 
38
Overview 
39 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exis-ng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
OBO 
Foundry 
A 
subset 
of 
biological 
and 
biomedical 
ontologies 
whose 
developers 
have 
agreed 
in 
advance 
to 
accept 
a 
common 
set 
of 
principles 
reflecHng 
best 
pracHce 
in 
ontology 
development 
designed 
to 
ensure 
• Hght 
connecHon 
to 
the 
biomedical 
basic 
sciences 
• CompaHbility 
• interoperability, 
common 
relaHons 
• formal 
robustness 
• support 
for 
logic-­‐based 
reasoning 
40
41 
hAp://www.obofoundry.org
RELATION 
TO TIME 
GRANULARITY 
CONTINUANT 
OCCURRENT 
INDEPENDENT 
DEPENDENT 
ORGAN AND 
ORGANISM 
Organism 
(NCBI 
Taxonomy?) 
Anatomical 
Entity 
(FMA, 
CARO) 
Organ 
Function 
(FMP, CPRO) 
Phenotypic 
Quality 
(PaTO) 
Organism-­‐‑Level 
Process 
(GO) 
CELL AND 
CELLULAR 
COMPONENT 
Cell 
(CL) 
Cellular 
Component 
(FMA, GO) 
Cellular 
Function 
(GO) 
Cellular Process 
(GO) 
MOLECULE 
Molecule 
(ChEBI, SO, 
RnaO, PrO) 
Molecular Function 
(GO) 
Molecular 
Process 
(GO) 
42 
Slide 
credit: 
Barry 
Smith
Minimum 
InformaHon 
to 
Reuse 
an 
External 
Ontology 
Term 
• OBO 
and 
SemaHc 
Web 
promote 
reuse 
of 
resources 
• Biological 
resources 
(e.g., 
FMA 
for 
anatomy), 
taken 
together, 
are 
too 
big 
for 
current 
tool 
support. 
• MIREOT 
used 
across 
the 
OBO 
library 
– OBI: 
400 
mireoted 
terms 
(140 
GO, 
55 
ChEBI, 
50 
PATO) 
– PR 
(Protein 
Ontology): 
23,000 
mireoted 
terms 
• h^p://ontofox.hegroup.org 
43
Example 
of 
OBO 
ontologies 
• OBI, 
Ontology 
for 
Biomedical 
invesHgaHons 
• VO, 
the 
vaccine 
ontology 
• AERO, 
the 
Adverse 
Event 
ReporHng 
Ontology
Ontology 
for 
Biomedical 
InvesHgaHons 
(OBI) 
• OBI 
is 
a 
mulH-­‐community 
project 
driven 
by 
the 
pracHcal 
needs 
of 
its 
members 
with 
the 
goal 
to 
build 
a 
high 
quality, 
interoperable 
reference 
ontology 
• OBI 
high 
level 
classes 
are 
in 
place 
-­‐ 
solidified 
over 
several 
years 
-­‐ 
that 
cover 
all 
aspects 
of 
biomedical 
invesHgaHons 
• OBI 
is 
expanded 
to 
enable 
member 
applicaHons 
and 
based 
on 
term 
requests 
45
46 
High 
level 
class 
hierarchy 
(parHal) 
Slide 
credit: 
OBI 
Consor)um
47 
Slide 
credit: 
Alan 
Ru=enberg
48 
Slide 
credit: 
OBI 
Consor)um
49 
RepresenHng 
vaccine 
data 
– 
the 
Vaccine 
Ontology 
(VO) 
Picture 
credit: 
Yongqun 
He
Overview 
50 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
RepresenHng 
pharmacovigilance 
data 
• The 
Adverse 
Event 
ReporHng 
Ontology 
(AERO) 
• Encodes 
exisHng 
clinical 
guidelines 
(Brighton 
CollaboraHon) 
Patient examination 
Anatomical system 
dermatological 
system 
Clinical Finding 
about mre 
involves 
Patient of rash 
Clinical Report 
has specified output 
has participant 
finding 
exam report of 
June 7 
rash 
Medically 
relevant entity 
has specified input 
'found to exhibit' some 'generalized urticaria or generalized erythema finding' 
'found to exhibit' some 'measured hypotension finding' 
inferred to be of type 
inferred to be of type 
major dermatological criterion 
for anaphylaxis according to Brighton 
major cardiovascular criterion 
for anaphylaxis according to Brighton 
has component has component 
Level 1 of certainty of anaphylaxis according to Brighton 
part of 
located in 
Clinician 
has participant 
is about 
found to exhibit 
51
Background 
and 
problem 
statement 
• Surveillance 
of 
Adverse 
Events 
Following 
Immuniza-on 
is 
important 
– DetecHon 
of 
issues 
with 
vaccine 
– Importance 
of 
vaccine-­‐risk 
communicaHon 
• Analysis 
of 
AE 
reports 
is 
a 
subjec-ve, 
-me-­‐ 
and 
money 
costly 
process 
– Manual 
review 
of 
the 
textual 
reports 
52
Workflow 
• Hypothesis: 
Use 
the 
AERO 
I 
developed 
to 
annotate 
and 
classify 
a 
dataset 
• VAERS 
dataset 
– Vaccine 
Adverse 
Event 
ReporHng 
System 
– 6032 
reports: 
~5800 
negaHve, 
~230 
posiHve 
– Post 
H1N1 
immunizaHon 
2009/2010 
– Manually 
classified 
for 
anaphylaxis 
• MedDRA 
(Medical 
DicHonary 
of 
Regulatory 
AcHviHes) 
is 
used 
to 
represent 
clinical 
findings 
53
54 
Automated 
Diagnosis 
workflow 
MANUALLY 
CURATED 
DATASET A 
ADVERSE EVENT 
REPORTING ONTOLOGY 
(AERO) 
OWL/RDF 
EXPORT 
VAERS DATASET 
MySQL 
BRIGHTON 
ANNOTATIONS 
ASCII files MySQL 
REASONER 
~800 MedDRA terms mapped to 32 Brighton terms 
? 
B 
C 
D
55 
Results 
MANUALLY 
CURATED 
DATASET A 
ADVERSE EVENT 
REPORTING ONTOLOGY 
(AERO) 
OWL/RDF 
EXPORT 
VAERS DATASET 
MySQL 
BRIGHTON 
ANNOTATIONS 
ASCII files MySQL 
REASONER 
~800 MedDRA terms mapped to 32 Brighton terms 
? 
B 
C 
D 
At 
best 
cut-­‐off 
point: 
Sensi-vity 
57% 
Specificity 
97%
3 
months 
manual 
56 
AE 
classificaHon 
can 
be 
improved 
through 
the 
use 
of 
ontologies 
Time gain 
Legend 
Manual analysis 
Ontology-based 
analysis 
November 2009 December 2009 January 2010 
Ability to 
detect signal 
6000 
reports 
• Manual 
Time 
2h 
automated 
analysis: 
3 
months 
for 
12 
medical 
officers 
• Ontology-­‐based 
vs. 
analysis: 
once 
data 
collected 
(2 
months), 
almost 
instantaneous 
(2h 
on 
laptop) 
=> 
Could 
allow 
for 
earlier 
detecHon 
of 
safety 
issues 
and 
be^er 
understanding 
of 
adverse 
events 
h^p://dx.doi.org/10.1371/journal.pone.0092632
Overview 
57 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
Seman-c 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
IRI 
dereferencing 
58
59 
Ontobee: 
publishing 
biomedical 
resources 
on 
the 
SemanHc 
Web 
HTML 
for 
humans 
…
Ontobee: 
publishing 
biomedical 
resources 
on 
the 
SemanHc 
Web 
… 
RDF 
for 
machines
Overview 
61 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaborm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
The 
Integrated 
Rapid 
InfecHous 
Disease 
Analysis 
(IRIDA) 
project 
• Goal: 
automate 
infecHous 
disease 
outbreak 
detecHon 
and 
invesHgaHon 
• Issues: 
– Integrate 
WGS, 
clinical 
and 
lab 
info 
– Provide 
relevant 
tools 
and 
validate 
pipeline 
• Methods: 
– Data 
standards 
for 
informaHon 
exchange 
– Analysis 
pipeline 
(Galaxy 
based) 
– User 
interface 
– AddiHonal 
tools: 
• IslandViewer 
• GenGIS 
62
63
Building 
the 
IRIDA 
data 
standards 
• Interview 
with 
key 
personnel 
at 
BCCDC 
• Review 
of 
exisHng 
resources 
• IdenHfy 
“holes”, 
i.e., 
missing 
bits 
• Collect 
exisHng 
data 
• Liaise 
with 
implementaHon 
team 
• Generate 
cohesive 
resource 
• Validate 
64
Relevant 
data 
standards 
• TypON, 
the 
typing 
ontology 
• OBI, 
the 
ontology 
for 
Biomedical 
InvesHgaHons 
• NGSOnto, 
Next 
GeneraHon 
Sequencing 
Ontology 
• NIAIS-­‐GS-­‐BRC 
core 
metadata 
• TRANS, 
Pathogen 
Transmission 
ontology 
• ExO, 
Exposure 
Ontology 
• EPO, 
Epidemiology 
Ontology 
• IDO, 
InfecHous 
Disease 
Ontology 
• Food: 
USDA, 
EFSA? 
65
Relevant 
internaHonal 
efforts 
• MIxS 
standard 
• Global 
Microbial 
IdenHfier 
• Global 
Alliance 
for 
Genomics 
and 
Health 
• NCBI 
BioSample 
• European 
NucleoHde 
Archive 
• … 
66
Remaining 
challenges 
• Trust, 
provenance 
– Ability 
to 
track 
origin 
of 
data 
to 
assess 
whether 
it 
is 
trustworthy 
• Data 
sharing, 
reuse, 
policy 
– Social 
and 
legal 
issues 
in 
ge…ng 
access 
to 
data 
• ConfidenHality 
– Privacy 
concerns 
when 
linking 
data 
67
Overview 
68 
• Big 
Data 
– Big 
Data 
is 
BIG 
– Issues 
in 
research 
• SemanHc 
Web 
– Standards: 
URIs, 
RDF, 
SPARQL, 
OWL 
– Linked 
data 
• Ontologies 
– DefiniHon 
and 
reasoning 
– OBO 
Foundry 
– Example 
of 
exisHng 
ontologies 
– Pharmacovigilance 
– Publishing 
ontologies 
on 
the 
SemanHc 
Web 
• IRIDA 
– The 
IRIDA 
plaXorm 
– Adding 
standards 
to 
IRIDA 
• Take 
home 
message
Take 
home 
message 
Big 
data 
is 
a 
big 
challenge, 
but 
we 
can 
deal 
with 
it 
if 
done 
properly: 
that 
will 
be 
your 
responsibility 
DO 
NOT 
build 
a 
black 
box 
DO 
annotate 
and 
describe 
your 
data 
DO 
make 
your 
data 
openly 
available 
69
Acknowledgements 
• Drs. 
Fiona 
Brinkman, 
Will 
Hsiao, 
Ryan 
Brinkman 
• The 
Brinkman^2 
labs 
• Alan 
Ru^enberg, 
Barry 
Smith, 
Chris 
Mungall 
& 
OBO 
• Colleagues 
at 
Public 
Health 
Agency 
Canada 
(Ms 
Lafleche, 
Dr 
Law) 
• The 
IRIDA 
consorHum 
and 
the 
IRIDA 
ontology 
working 
group 
(Emma 
Griffiths 
and 
Damion 
Dooley) 
70
71 
Mélanie 
Courtot, 
PhD 
mcourtot@sfu.ca 
@mcourtot 
h^p://purl.org/net/mcourtot

20141112 courtot big_datasemwebontologies

  • 1.
    Big data, Seman-c Web and Ontologies Mélanie Courtot, PhD Nov 12th 2014 mcourtot@sfu.ca 1
  • 2.
  • 3.
    Overview 3 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 4.
    Overview 4 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 5.
  • 6.
    Big data Big data is data that is too large and complex to process for any convenHonal data tools. 6
  • 7.
  • 8.
  • 9.
    What is a Ze^abyte? 1,000,000,000,000 gigabytes 1,000,000,000,000 terabytes 1,000,000,000,000 petabytes 1,000,000,000,000 exabytes 1,000,000,000,000 zeAabyte 9
  • 10.
    How big is big? • Facebook: 25 Terabytes of logged data per day, Google (2008): 20 Petabytes per day • Over 90% of all the data in the world was created in the past 2 years [1] • Today 3.2 ze^abytes. 2020: 40 zeAabytes.[2] • Good news: jobs! [3] 1. http://www-01.ibm.com/software/data/bigdata/ 2. http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/ 10 3. http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html
  • 11.
  • 12.
    12 Issues with research data (1): data availability h^p://www.nature.com/news/scienHsts-­‐losing-­‐data-­‐at-­‐a-­‐rapid-­‐rate-­‐1.14416
  • 13.
    Issues with research data (2): data reproducibility h^p://www.firstwordpharma.com/node/931605#axzz3IalL2lzU 13
  • 14.
    Overview 14 •Big Data – Big Data is BIG – Issues in research • Seman-c Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 15.
    A soluHon: the SemanHc Web "The Seman*c Web is an ... extension of the current web in which ... informa*on is given well-­‐defined meaning, ... be?er enabling computers and people to work in coopera*on.” The Seman)c Web Tim Berners-­‐Lee, James Hendler and Ora Lassila ScienHfic American, May 2001 http://www.scientificamerican.com/article/the-semantic-web/15
  • 16.
    The SemanHc Web in a nutshell Adds to Web standards and prac*ces (currently only for documents and services) encouraging • Unambiguous names for things, classes, and relaHonships • Well organized and documented in ontologies • With data expressed using uniform knowledge representaHon languages (e.g. OWL) • To enable computaHonally assisted exploitaHon of informaHon • That can be easily integrated from different sources 16
  • 17.
    Some SemanHc Web successes • In February 2011, the Watson system by IBM made internaHonal headlines for beaHng the best humans in the quiz show Jeopardy! • A significant number of very prominent websites are powered by Seman-c Web technologies, including the New York Times, Thomson Reuters, BBC, and Google's Freebase. • The Speech Interpreta-on and Recogni-on Interface Siri launched by Apple in 2011 as an intelligent personal assistant for the new generaHon of IPhone smartphones heavily draws from work on ontologies, knowledge representaHon, and reasoning. 17 h^p://130.108.5.60/faculty/pascal/pub/crc-­‐handbook-­‐13.pdf
  • 18.
  • 19.
    Uniform Resource IdenHfiers (URIs) • Two different uses: – Unambiguous name for something – LocaHon of a document • Examples: – h^p://example.org/wiki/Main_Page – sp://example.org/resource.txt – mailto:someone@example.com 19
  • 20.
    Resource DescripHon Framework (RDF) • Resources (= nodes) • Identified by Unique Resource Identifier (URI) • Properties (= edges) • Identified by Unique Resource Identifier (URI) • Binary relations between 2 resources 20 h^p://elmonline.ca/sw/sparql/social.^l
  • 21.
    <h^p://www.linkedin.com/in/mcourtot> a foaf:Person ; foaf:name "Melanie Courtot" ; foaf:knows <h^p://elmonline.ca/luke> ; foaf:knows <h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665> . 21
  • 22.
    SPARQL SELECT ?person WHERE { <h^p://www.linkedin.com/in/mcourtot> <h^p://xmlns.com/foaf/0.1/knows> ?person . } -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ | person | ========================================================== | h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665 | | <h^p://elmonline.ca/luke> | -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ • An excellent tutorial by Luke McCarthy: h^p://elmonline.ca/sw/sparql/ 22 A query language for RDF
  • 23.
    The Web Ontology Language (OWL) • Knowledge representaHon language • Based on DescripHon Logics: fragments of First-­‐Order logics with decidable and defined computaHonal properHes • Sound, complete, terminaHng reasoners available 23
  • 24.
    Overview 24 •Big Data – Big Data is BIG – Issues in research • Seman-c Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 25.
  • 26.
  • 27.
    Examples of issues in linking data incorrectly • h^p://dbpedia.org/resource/Welsh OWL:sameAs <h^p://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh> <h^p://sw.cyc.com/2006/07/27/cyc/Welsh-­‐TheWord> <h^p://sw.cyc.com/2006/07/27/cyc/WelshLanguage> <h^p://sw.cyc.com/2006/07/27/cyc/Welshing-­‐Chea-ng> 27
  • 28.
    Overview 28 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – Defini-on and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 29.
    Ontologies • RepresentaHon of important things in a specific domain – Describes types of enHHes (e.g. cells) and relaHons between them (e.g. prokaryoHc cells and eukaryoHc cells are cells) and their instances (e.g. the specific cells in my sample) • An acHve computaHonal arHfact – A mathemaHcal model based on a subset of first order logic – Tools can automaHcally process ontologies • A communicaHon tool – Provides a dicHonary for collaborators, a shared understanding – Allows data sharing 29
  • 30.
    Reasoning is criHcal • ProkaryoHc and EukaryoHc cell are declared disjoints • Fungal cell is a EukaryoHc cell • Spore is a Fungal cell and a ProkaryoHc cell ⇒ InsaHsfiability ⇒ SoluHon: clarify spore (sensu Mycetozoa) AND acHnomycete-­‐type spore h^p://www.plosone.org/arHcle/info:doi/10.1371/journal.pone.0022006 30
  • 31.
    Logics • Simple example based on h^p://arxiv.org/pdf/1201.4089v1.pdf • Ontology file available from h^p://www.sfu.ca/~mcourtot/course/ 20141112BigDataSemWebOntologies/ ontology.owl • ManipulaHon done using Protégé: h^p://protege.stanford.edu 31
  • 32.
  • 33.
    Logics of a grandfather 33
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    Overview 39 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exis-ng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 40.
    OBO Foundry A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecHng best pracHce in ontology development designed to ensure • Hght connecHon to the biomedical basic sciences • CompaHbility • interoperability, common relaHons • formal robustness • support for logic-­‐based reasoning 40
  • 41.
  • 42.
    RELATION TO TIME GRANULARITY CONTINUANT OCCURRENT INDEPENDENT DEPENDENT ORGAN AND ORGANISM Organism (NCBI Taxonomy?) Anatomical Entity (FMA, CARO) Organ Function (FMP, CPRO) Phenotypic Quality (PaTO) Organism-­‐‑Level Process (GO) CELL AND CELLULAR COMPONENT Cell (CL) Cellular Component (FMA, GO) Cellular Function (GO) Cellular Process (GO) MOLECULE Molecule (ChEBI, SO, RnaO, PrO) Molecular Function (GO) Molecular Process (GO) 42 Slide credit: Barry Smith
  • 43.
    Minimum InformaHon to Reuse an External Ontology Term • OBO and SemaHc Web promote reuse of resources • Biological resources (e.g., FMA for anatomy), taken together, are too big for current tool support. • MIREOT used across the OBO library – OBI: 400 mireoted terms (140 GO, 55 ChEBI, 50 PATO) – PR (Protein Ontology): 23,000 mireoted terms • h^p://ontofox.hegroup.org 43
  • 44.
    Example of OBO ontologies • OBI, Ontology for Biomedical invesHgaHons • VO, the vaccine ontology • AERO, the Adverse Event ReporHng Ontology
  • 45.
    Ontology for Biomedical InvesHgaHons (OBI) • OBI is a mulH-­‐community project driven by the pracHcal needs of its members with the goal to build a high quality, interoperable reference ontology • OBI high level classes are in place -­‐ solidified over several years -­‐ that cover all aspects of biomedical invesHgaHons • OBI is expanded to enable member applicaHons and based on term requests 45
  • 46.
    46 High level class hierarchy (parHal) Slide credit: OBI Consor)um
  • 47.
    47 Slide credit: Alan Ru=enberg
  • 48.
    48 Slide credit: OBI Consor)um
  • 49.
    49 RepresenHng vaccine data – the Vaccine Ontology (VO) Picture credit: Yongqun He
  • 50.
    Overview 50 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 51.
    RepresenHng pharmacovigilance data • The Adverse Event ReporHng Ontology (AERO) • Encodes exisHng clinical guidelines (Brighton CollaboraHon) Patient examination Anatomical system dermatological system Clinical Finding about mre involves Patient of rash Clinical Report has specified output has participant finding exam report of June 7 rash Medically relevant entity has specified input 'found to exhibit' some 'generalized urticaria or generalized erythema finding' 'found to exhibit' some 'measured hypotension finding' inferred to be of type inferred to be of type major dermatological criterion for anaphylaxis according to Brighton major cardiovascular criterion for anaphylaxis according to Brighton has component has component Level 1 of certainty of anaphylaxis according to Brighton part of located in Clinician has participant is about found to exhibit 51
  • 52.
    Background and problem statement • Surveillance of Adverse Events Following Immuniza-on is important – DetecHon of issues with vaccine – Importance of vaccine-­‐risk communicaHon • Analysis of AE reports is a subjec-ve, -me-­‐ and money costly process – Manual review of the textual reports 52
  • 53.
    Workflow • Hypothesis: Use the AERO I developed to annotate and classify a dataset • VAERS dataset – Vaccine Adverse Event ReporHng System – 6032 reports: ~5800 negaHve, ~230 posiHve – Post H1N1 immunizaHon 2009/2010 – Manually classified for anaphylaxis • MedDRA (Medical DicHonary of Regulatory AcHviHes) is used to represent clinical findings 53
  • 54.
    54 Automated Diagnosis workflow MANUALLY CURATED DATASET A ADVERSE EVENT REPORTING ONTOLOGY (AERO) OWL/RDF EXPORT VAERS DATASET MySQL BRIGHTON ANNOTATIONS ASCII files MySQL REASONER ~800 MedDRA terms mapped to 32 Brighton terms ? B C D
  • 55.
    55 Results MANUALLY CURATED DATASET A ADVERSE EVENT REPORTING ONTOLOGY (AERO) OWL/RDF EXPORT VAERS DATASET MySQL BRIGHTON ANNOTATIONS ASCII files MySQL REASONER ~800 MedDRA terms mapped to 32 Brighton terms ? B C D At best cut-­‐off point: Sensi-vity 57% Specificity 97%
  • 56.
    3 months manual 56 AE classificaHon can be improved through the use of ontologies Time gain Legend Manual analysis Ontology-based analysis November 2009 December 2009 January 2010 Ability to detect signal 6000 reports • Manual Time 2h automated analysis: 3 months for 12 medical officers • Ontology-­‐based vs. analysis: once data collected (2 months), almost instantaneous (2h on laptop) => Could allow for earlier detecHon of safety issues and be^er understanding of adverse events h^p://dx.doi.org/10.1371/journal.pone.0092632
  • 57.
    Overview 57 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the Seman-c Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 58.
  • 59.
    59 Ontobee: publishing biomedical resources on the SemanHc Web HTML for humans …
  • 60.
    Ontobee: publishing biomedical resources on the SemanHc Web … RDF for machines
  • 61.
    Overview 61 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaborm – Adding standards to IRIDA • Take home message
  • 62.
    The Integrated Rapid InfecHous Disease Analysis (IRIDA) project • Goal: automate infecHous disease outbreak detecHon and invesHgaHon • Issues: – Integrate WGS, clinical and lab info – Provide relevant tools and validate pipeline • Methods: – Data standards for informaHon exchange – Analysis pipeline (Galaxy based) – User interface – AddiHonal tools: • IslandViewer • GenGIS 62
  • 63.
  • 64.
    Building the IRIDA data standards • Interview with key personnel at BCCDC • Review of exisHng resources • IdenHfy “holes”, i.e., missing bits • Collect exisHng data • Liaise with implementaHon team • Generate cohesive resource • Validate 64
  • 65.
    Relevant data standards • TypON, the typing ontology • OBI, the ontology for Biomedical InvesHgaHons • NGSOnto, Next GeneraHon Sequencing Ontology • NIAIS-­‐GS-­‐BRC core metadata • TRANS, Pathogen Transmission ontology • ExO, Exposure Ontology • EPO, Epidemiology Ontology • IDO, InfecHous Disease Ontology • Food: USDA, EFSA? 65
  • 66.
    Relevant internaHonal efforts • MIxS standard • Global Microbial IdenHfier • Global Alliance for Genomics and Health • NCBI BioSample • European NucleoHde Archive • … 66
  • 67.
    Remaining challenges •Trust, provenance – Ability to track origin of data to assess whether it is trustworthy • Data sharing, reuse, policy – Social and legal issues in ge…ng access to data • ConfidenHality – Privacy concerns when linking data 67
  • 68.
    Overview 68 •Big Data – Big Data is BIG – Issues in research • SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data • Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web • IRIDA – The IRIDA plaXorm – Adding standards to IRIDA • Take home message
  • 69.
    Take home message Big data is a big challenge, but we can deal with it if done properly: that will be your responsibility DO NOT build a black box DO annotate and describe your data DO make your data openly available 69
  • 70.
    Acknowledgements • Drs. Fiona Brinkman, Will Hsiao, Ryan Brinkman • The Brinkman^2 labs • Alan Ru^enberg, Barry Smith, Chris Mungall & OBO • Colleagues at Public Health Agency Canada (Ms Lafleche, Dr Law) • The IRIDA consorHum and the IRIDA ontology working group (Emma Griffiths and Damion Dooley) 70
  • 71.
    71 Mélanie Courtot, PhD mcourtot@sfu.ca @mcourtot h^p://purl.org/net/mcourtot