Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph Database with an application in PubChem

Contextualized Knowledge Graph
from two perspectives
Semantic Web and Graph Database
with an application in
Presenter: Vinh Nguyen

What is Knowledge Graph?
10/25/2018 4

What is Knowledge Graph?
10/25/2018 5

What is Contextualized Knowledge Graph?
10/25/2018 6
A contextualized knowledge graph is a knowledge graph in which
every fact is qualified with a set of contextual properties.

Subject Predicate Object Starts Ends
Bob Dylan marriedTo Sarah Lownds 1965-11-22 1977-06-29
Bob Dylan marriedTo Carolyn Dennis 1986-06-## 1992-10-##
Motivation Scenario
Facts:
Meta Queries:
Query type Sample query
Provenance P1. Where is this fact from?
P2. When was it created?
P3. Who created this fact?
Time T1. When did this fact occur?
T2. What is the time span of this fact?
T3. Which events happened in the same year?
Location L1. What is the location associated with this fact?
L2. Which events happened at the same place?
Certainty C1. What is the author confidence of this fact?
7
Subject Predicate Object
Bob Dylan marriedTo Sarah Lownds
Bob Dylan marriedTo Carolyn Dennis

8
from
Semantic Web perspective

9
2973 datasets with 149 billion triples
Linked Data principles
Use URIs as names
Use HTTP URLs to be looked up
URI provides useful info using
standard
Include links to other URIs to
discover more

RDF Reification
Form of Triples: RDF Reification
Pros:
1. Intuitive, easy to understand
Cons:
1. Takes 3N triples (4N if including
Statement typing) to represent a
statement => Not scalable
2. No formal semantics defined =>
Semantics is unclear
3. Discouraged in LOD!
Time-aware Facts:
11
#stmt1 type Statement
#stmt1 hasSubject BobDylan
#stmt1 hasProperty marriedTo
#stmt1 hasObject Sara Lownds
#stmt1 starts 1965-11-22
#stmt1 ends 1977-06-29

RDF Reification
RDF Reification vs. Singleton Property
Time-aware Facts:
#stmt1 type Statement
#stmt1 hasSubject BobDylan
#stmt1 hasProperty marriedTo
#stmt1 hasObject Sara Lownds
#stmt1 starts 1965-11-22
#stmt1 ends 1977-06-29
marriedTo#1 rdf:sp marriedTo
BobDylan marriedTo#1 Sarah Lownds
marriedTo#1 starts 1965-11-22
marriedTo#1 ends 1977-06-29
Singleton Property
12
Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. "Don't like RDF reification?: making statements about statements
using singleton property." In Proceedings of the 23rd international conference on World wide web, pp. 759-770. ACM,
2014.

Subject Predicate Object Source DateExtracted
Bob Dylan marriedTo Sarah Lownds wikipage:Bob_Dylan 2009-06-07
Form of Triples: PaCE
Pros:
1. Save ~50% number of triples
compared to reification thanks
to the repeated subject,
predicate, and object.
Cons:
1. Not intuitive, hard to
understand
2. Limited expressiveness
Provenance-aware Facts:
13
Provenance-aware Context Entity
BobDylan_wp rdf:type Bob Dylan
SaraLownds_wp rdf:type Sara Lownds
BobDylan_wp marriedTo SaraLownds_wp
BobDylan_wp hasSource wiki:Bob_Dylan
BobDylan_wp hasDateExt 2009-06-07
Satya S. Sahoo, Olivier Bodenreider, Pascal Hitzler, Amit Sheth, and Krishnaprasad Thirunarayan. 2010. Provenance
context entity (PaCE): scalable provenance tracking for scientific RDF data. In Proceedings of the 22nd international
conference on Scientific and statistical database management (SSDBM'10),

Subject Predicate Object Source DateExtracted
Bob Dylan marriedTo Sarah Lownds wikipage:Bob_Dylan 2009-06-07
Provenance-aware Context Entity
BobDylan_wp rdf:type Bob Dylan
SaraLownds_wp rdf:type Sara Lownds
BobDylan_wp marriedTo SaraLownds_wp
BobDylan_wp hasSource wiki:Bob_Dylan
BobDylan_wp hasDateExt 2009-06-07
Facts and Provenance:
14
PaCE vs. Singleton Property
BobDylan marriedTo#1 Sarah Lownds
marriedTo#1 hasSource wp:Bob_Dylan
marriedTo#1 hasDateExt 2009-06-07
Singleton Property

Form of Quadruples: Named Graph
Pros:
1. Intuitive --creating # named graphs
for # sources
2. Attach metadata for a set of triples
3. SPARQL supported
Cons:
1. Defined for provenance only
2. Ambiguous semantics while
associating different types of
metadata at triple level
Time-aware Facts:
* Carroll, Jeremy J., et al. "Named graphs, provenance and trust." Proceedings of the 14th international conference on World Wide Web. ACM, 2005.
15
Named Graph
Subject Predicate Object NG
Bob Dylan marriedTo Sarah Lownds ng_1
ng_1 starts 1965-11-22 Prov_graph
ng_2 ends 1977-06-29 Prov_graph

Named Graph
Subject Predicate Object NG
Bob Dylan marriedTo Sarah Lownds ng_1
ng_1 starts 1965-11-22 Prov_graph
ng_2 ends 1977-06-29 Prov_graph
Time-aware Facts:
Named Graph vs. Singleton Property
Bob Dylan marriedTo#1 Sarah Lownds
marriedTo#1 starts 1965-11-22
marriedTo#1 ends 1977-06-29 16
Singleton Property

RDF+:
Subject Predicate Object Meta Property Meta value
Bob Dylan marriedTo Sarah Lownds starts 1965-11-22
Bob Dylan marriedTo Sarah Lownds ends 1977-06-29
Form of Quintuples: RDF+
Cons:
1. The representation is not in the form of RDF. Statement identifiers are used
internally. Require the mappings from RDF to RDF+ and vice versa.
2. The SPARQL query syntax and semantics need to be extended to support RDF+
Facts and Temporal Information:
* Dividino, Renata, et al. "Querying for provenance, trust, uncertainty and other meta knowledge in RDF." Web
Semantics: Science, Services and Agents on the World Wide Web 7.3 (2009): 204-219.
17

Experiment: BKR with Provenance
All datasets are available at http://wiki.knoesis.org/index.php/Singleton_Property 20
• Five data sets generated from the same seed BKR
 Singleton Property (SP)
 Reification (R)
 PaCE C1 (C1)
 PaCE C2 (C2)
 PaCE C3 (C3)

Experiment Results
(A) random-value queries vs. fixed-value queries in msec.
(B) query length and execution time in msec. 21

• Gang Fu, Evan Bolton, Núria Queralt Rosinach, Laura I Furlong, Vinh Nguyen, Amit
Sheth, Olivier Bodenreider, Michel Dumontier. Exposing provenance metadata using
different RDF models. In Proceedings of Semantic Web Applications and Tools for
Life Science (SWAT4LS), 2016.
https://pubchem.ncbi.nlm.nih.gov/
• Hernández, Daniel, Aidan Hogan, and Markus Krötzsch. "Reifying RDF: What works
well with wikidata?." SSWS@ ISWC 1457 (2015): 32-47.
• Frey, Johannes, Kay Müller, Sebastian Hellmann, Erhard Rahm, and Maria-Esther
Vidal. "Evaluation of Metadata Representations in RDF stores.”
• Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas, Enzo Zerega:
Querying Wikidata: Comparing SPARQL, Relational and Graph Databases.
International Semantic Web Conference (2) 2016: 88-103
22
External Evaluation

Subject Predicate Object Source FromDataset Confidence
CID5280961(Genistein) inhibits GID2100(ESR2) PMID12502307 ChemBL
CID5757(Estradiol) activates GID2100(ESR2) PMID19128016 ChemBL
10/25/2018
Exposing provenance metadata using different RDF models
Gang Fu, Evan Bolton, Núria Queralt Rosinach, Laura I Furlong, Vinh Nguyen, Amit Sheth, Olivier Bodenreider, Michel Dumontier

Model I Model II Model III Model IV Model V
22,787,218 21,445,348 19,575,298 17,239,427 27,605,782
24
PubChem
• Five data sets generated from the same seed
 N-ary with cardinal assertion (Model I)
 N-ary without cardinal assertion (Model II)
 Singleton property with cardinal assertion (Model III)
 Singleton property without cardinal assertion (Model IV)
 NanoPublication (Model V)
• Comparing sizes of generated datasets
 SP datasets are the most compact ones
Gang Fu, Evan Bolton, Núria Queralt Rosinach, Laura I Furlong, Vinh Nguyen, Amit Sheth, Olivier
Bodenreider, Michel Dumontier. Exposing provenance metadata using different RDF models. In
Proceedings of Semantic Web Applications and Tools for Life Science (SWAT4LS), 2016.

25
PubChem
• Query performance in secs
 SP models (III and IV) outperforms other models in Virtuoso

27
WikiData
• Four data sets generated from the same seed
 Standard Reification (SR)
 N-ary relation (NR)
 Singleton property (SP)
 Named Graph (NG)
 SP dataset is the most compact one
Hernández, Daniel, Aidan Hogan, and Markus Krötzsch. "Reifying RDF: What works well with
wikidata?." SSWS@ ISWC 1457 (2015): 32-47.

28
WikiData
• Query performance in 4store and GraphDB
 SP models are not supported by 4store and GraphDB
• Query performance in Virtuoso and BlazeGraph
 Reification and NG are well-supported by Virtuoso and
BlazeGraph
 SP is little faster than NR in Virtuoso, slower in BlazeGraph

29
WikiData
• Six data sets generated from the same seed
 Standard Reification (stdreif)
 N-ary relation (naryrel)
 Singleton property (sgprop)
 Companion property (cpprop)
 Named Graph (ngraphs)
 RDF* (rdr)
 SP dataset is the most compact triple representation
 Fastest in loading time for WikiData
 Best query performance for StarDog in all cases
 Slowest in Virtuoso but not by much for WikiData queries
 Not encounter performance issues with SP
Frey, Johannes, Kay Müller, Sebastian Hellmann, Erhard Rahm, and Maria-Esther Vidal. "Evaluation of
Metadata Representations in RDF stores."

30
Experimental Comparison
• Dataset size
 SP offers the most concise representation in all cases
• Query performance
 SP performs reasonably well in Virtuoso, best in StarDog, OK in
BlazeGraph
 SP may have the potential for the performance gain if
supported and optimized by the query engines
Is SP representation optimal?

31
from
Graph Database perspective

Bob Dylan marriedTo Carolyn Dennis 1986-06-## 1992-10-##
Property Graph
Facts:
32
Bob Dylan marriedTo Carolyn Dennis
Name: CarolynDennisName: SaraLownds
2 3
Name: BobDylan
1
marriededTo marriededTo
Starts: 1965-11-22
Ends: 1977-06-29
Starts: 1986-06-##
Ends: 1992-10-##

33
with an application in

10/25/2018
Neighbor: only available through REST interface

10/25/2018 35
PubChem Neighbor

10/25/2018 36
Current PubChem Neighbor
• Number of links
 92,000,000 * 92,000,000 / 2 = 4.232 * 10^15
 4 quadrillion
• Challenges
⨯ Number of triples increases to quadrillion
⨯ SPARQL query processing for Quadrillion triples
• Is it worth?
 Chemical similarity is one of the most important concept in
chemoinformatics
 Similar compounds have similar properties

Current PubChem Neighbor
nbr:CID1_CID2_2DSim has_measurement_value nbr:CID1_CID2_2DTanimotoScore
nbr:CID1_CID2_2DSim refers_to compound:CID1
nbr:CID1_CID2_2DSim type pcvocab:PC2D_structural_similarity
nbr:CID1_CID2_2DTanimotoScore has_value 0.91^^xsd:float
nbr:CID1_CID2_2DTanimotoScore Is_output_of sio:CHEMINF_000333
nbr:CID1_CID2_2DTanimotoScore type pcvocab:PC2D_Fingerprint_TanimotorScore
10/25/2018
1 neighbor link: 7 triples
compound:CID1 sio:CHEMINF_000482 compound:CID2
4 quadrillion x 7 = 28 quadrillion triples

PubChem Neighbor using CKG Model
10/25/2018
nbr:CID1_CID2_2DSim has_measurement_value nbr:CID1_CID2_2DTanimotoScore
nbr:CID1_CID2_2DSim type pcvocab:PC2D_structural_similarity
nbr:CID1_CID2_2DTanimotoScore has_value 0.91^^xsd:float
nbr:CID1_CID2_2DTanimotoScore Is_output_of sio:CHEMINF_000333
nbr:CID1_CID2_2DTanimotoScore type pcvocab:PC2D_Fingerprint_TanimotorScore
compound:CID1 has_structural_similarity?sp=1&ds=pc&is_output_of=sio:CHEMINF_00
0333&has_2d_tanimoto_score=0.91^^xsd
compound:CID1
1 neighbor link: 1 triple
4 quadrillion x 1 = 4 quadrillion triples

10/25/2018
< 20 billion CKG triples

Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph Database with an application in PubChem

Recommended

Recommended

More Related Content

Similar to Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph Database with an application in PubChem

Similar to Contextualized Knowledge Graphfrom two perspectives: Semantic Web and Graph Database with an application in PubChem (20)

Recently uploaded

Recently uploaded (20)