WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Building a Knowledge Graph @ Graph Day 2018
1. REUTERS / Danish Ismail
BUILDING A KNOWLEDGE GRAPH
DAN BENNETT - GRAPH DAY 2018
@nonodename
SEPTEMBER, 2018
2. AGENDA
• A little on TR
• What’s a knowledge graph?
• Quick reset on RDF - if needed
• Data engineering for our knowledge graph
• Lessons learned
• Q&A
4. THOMSON REUTERS - THE ANSWER COMPANY
• Information, technology and
expertise for professionals
• Focus on finance, risk, media,
legal, tax and accounting
markets
• 87% recurring revenue, 93%
electronic, global footprint
• My role: big data & NLP within
central technology group
supporting market aligned
business units
REUTERS/Amit Dave
7. WHAT IS A KNOWLEDGE GRAPH?
• Open world representation of
information
• Every entry point is equal cost
• Underpin Cortana, Google
Assistant, Siri, Alexa
• Typically (but doesn’t have to
be) expressed in RDF
Score
Team
Team
Game
6-1
Venue
Panama
England
Nizhy Novgorod Stadium
Score
Score
Score
Stones8
hasName
hasLogo
hasFinalScore played
hasName
hasLogo
playedAt
hasName
hasQuarter
atTime
byPlayer
10. SCHEMA ON WRITE
• Fixed data model
• Slow to change
• Strong enforcement
11. SCHEMA ON READ
• Capture everything
• Apply logic (schema) on read
• No standards
12. RDF: SCHEMA ON READ, OPTIONAL ON WRITE
Schema on Read Schema on Write
Accuracy
Difficult & slow to
change
Anything goes
Federated
RDF
Standards
(potentially) verbose
Triggers/Stored Procs/IDs
Referential integrity
on write
Referential integrity
on read
Super flexible
Capture everything
Flexible
13. HOW CAN THAT BE? (SIMPLIFIED!)
ID Date Amount Customer
1 30-Aug-2016 56.84 1
2 31-Aug-2016 42.36 2
3 1-Sep-2016 98.45 1
4 1-Sep-2016 23.54 3
ID Name
1 Barack Obama
2 Richard Nixon
3 Ronald Reagan
4 Bill Clinton
Orders Customers
Subject Predicate Object
http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830
http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84
http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/
customers/1http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/order
http://tr.com/orders/2 http://ont.tr.com/orders/order_date 20160831
http://tr.com/orders/2 http://ont.tr.com/orders/order_amount 42.36
http://tr.com/orders/2 http://ont.tr.com/orders/order_customer http://tr.com/
customers/2http://tr.com/orders/2 http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/order
… … …
http://tr.com/
customers/1
http://ont.tr.com/customers/name Barack Obama
http://tr.com/
customers/1
http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/
customerhttp://tr.com/
customers/2
http://ont.tr.com/customers/name Richard Nixon
http://tr.com/
customers/2
http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
http://ont.tr.com/
customer… … …
RelationalRDF
• URI = primary key
• New column = new
rows
• Sparse if row missing
• Object a relation or
literal
14. SCHEMA, QUERY & FEDERATION
Subject Predicate Object
http://tr.com/orders/1 http://ont.tr.com/orders/order_date 20160830
http://tr.com/orders/1 http://ont.tr.com/orders/order_amount 56.84
http://tr.com/orders/1 http://ont.tr.com/orders/order_customer http://tr.com/customers/1
http://tr.com/orders/1 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://ont.tr.com/order
http://tr.com/orders/1 http://ont.salesforce.com/crm/customer_spend 9856.45
http://tr.com/customers/1 http://www.w3.org/2002/07/owl#sameAs http://en.wikipedia.org/wiki/Richard_Nixon
http://en.wikipedia.org/wiki/Richard_Nixon http://owl.wikipedia.org/born 19130109
Federated data
(spend from
CRM)
Relation to
external data
Schema (Ontology)
More than one can
apply to a subject
• Sparql - like SQL. Sum all orders:
SELECT sum(?amount)
WHERE {
?order <http://ont.tr.com/orders/order_amount> ?amount .
?order <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://ont.tr.com/order>
}
15. WHY RDF FOR A KNOWLEDGE GRAPH?
RDF DB Property Graph DB
Open Yes Maybe
Incremental Load Via named graph or
SPARQL
Maybe
Federated Data Yes No
Modelling tools Yes Unlikely
Types/Classes/higher
abstractions Yes No
17. PHYSICAL
Snaplogic
or Hadoop
ETL
Sources
(Relational,
Proprietary)
RDF
CM-Well:
RDF Store
Mart/
Products
Pull
Push
Batch
REST Based
Publishing
HTTPS
FTP/HTTP
JDBC
Warehouse
Remote
Read
Replicas
RDF
Full text
mining
RDBMS
Web services
Sed ut perspiciatis unde omnis iste
natus error sit voluptatem
accusantium doloremque laudantium,
totam rem aperiam, eaque ipsa quae
ab illo inventore veritatis et quasi
architecto beatae vitae dicta sunt
explicabo. Nemo enim ipsam
voluptatem quia voluptas sit
aspernatur aut odit aut fugit, sed quia
consequuntur magni dolores eos qui
ratione voluptatem sequi nesciunt.
Neque porro quisquam est, qui
dolorem ipsum quia dolor sit amet,
consectetur, adipisci velit, sed quia
non numquam eius modi tempora
incidunt ut labore et dolore magnam
aliquam quaerat voluptatem. Ut enim
ad minima veniam, quis nostrum
exercitationem ullam corporis suscipit
Neptune
Elastic
RDBMS
Filesystem
Filesystem
18. LOGICAL
SPARQL
SPARQL Triggers
As captured
• Mechanistic conversion
• Minimal validation
• Named graph for W3C
provenance & update
Target Model
• “Canonical Graph”
• Curated ontologies
• Normalized
representation
Selective
Product Models
• Slice & dice
• Store/retrieve using
whatever works
• Not necessarily graph
19. OUR GRAPH WAREHOUSE: CM-WELL
HA Proxy
…
REST/HTTP
REST/HTTP
• NOT a triple store!
Focus is on data
movement
• No master node
• Linear scaling
• Stateless
• JVM isolation
• Query based
subscription
• Logical replication
• Available on GitHub
CM-Well Node
Cassandra
Elastic Kafka
Web
Server
Background
Processing
User
workload
Health Control Layer
CM-Well Node
Cassandra
Elastic Kafka
Web
Server
Background
Processing
User
workload
Health Control Layer
Roaming
Grid
CM-Well Node
Cassandra
Elastic Kafka
Web
Server
Background
Processing
User
workload
Health Control Layer
21. RELATIONAL
• Map primary keys into
own namespace (or
assign surrogate keys)
• Map dimensions to
existing entities if
possible
• Concentrate on the
relations and
attributes that matter
• Can always return to
the source for details
<https://permid.org/1-4297089638>
a tr-org:Organization ;
tr-common:hasPermId "4297089638"^^xsd:string ;
tr-org:isIncorporatedIn <http://sws.geonames.org/6252001/> ;
fibo-be-le-cb:isDomiciledIn <http://sws.geonames.org/6252001/> ;
vcard:hasURL <https://www.tesla.com/> ;
vcard:organization-name "Tesla Inc"^^xsd:string .
<https://permid.org/1-34421840245>
a tr-person:Person ;
vcard:family-name "Musk"^^xsd:string ;
vcard:given-name "Elon"^^xsd:string .
<https://permid.org/2-497b8953cd00ec12589126c0f1116e2ca8fb484b80722
person:hasPositionType o:1-10010134 ;
person:hasReportedTitle "Chairman of the Board" ;
person:isPositionIn o:1-4297089638 .
Surrogate Key
Relationship with
properties
Existing ontologies
Existing
dimension
22. FULL TEXT
• Link to source (Retain confidence)
• Provenance in quad for updates <https://data.tr.com/sc/4297089638_4295869694>
a tr-sc:SupplyChainAgreement ;
tr-sc:aggregateConfidence “0.9999976445274502”^^xs
tr-sc:supplier <https://permid.org/1-4297089638>;
tr-sc:customer <https://permid.org/1-4295869694>.
<https://data.tr.com/sc/snippet/4297089638_429586969
a tr-sc:Snippet ;
tr-sc:snippetText "~~~Tesla~~~ is supplying electr
tr-sc:confidence "0.999"^^xsd:float;
tr-sc:source “nL1N0IL11N-2013-10-31"^^xsd:string.
Article primary key
23. PROVENANCE IS INVALUABLE
• W3C Provenance applied by named graph:
• Can also be used to model bi-temporality if needed
• Example
Source A states
<S>, <P>, “O”
Source B states
<S>, <P>, <O>
Append unique
named Graph
on load
<S>, <P>, “O”, <G1>
<S>, <P>, <O>, <G2>
<G1>, <prov:wasGeneratedBy>, “Snaplogic”
<G1>, <prov:wasDerivedFrom>, “Database source”
etc.
Graph URI could be hash of
S,P,O or GUID, etc.
Consider idempotence and
determinism
24. MODELLING BI-TEMPORALITY
• Not inherently supported in RDF
• Possible solutions
• Ignore!
• Model for particular values (potentially
using blank nodes)
• Model on named graph
• Reification
• Use RDF* & SPARQL* (Reification Done
Right - only in BlazeGraph…)
Name
“Apple Computer”
From: 1977-03-01
To: 2007-09-01
Name
Apple Inc
From: 2007-09-01
Organization
Apple
Has Name
Has Name
Specific model approach
org:2-xyz {
org:1-4295905573 org:hasName "Apple Computer" .
}
org:2-xyz
bt:effectiveFrom "1977-03-01"^^xsd:date ;
bt:effectiveTo "2007-09-01"^^xsd:date .
Named Graph Approach
Temporality on Named Graph
26. RDF IS DIFFERENT, IA IS KING
• Early education is key
• Strong information architecture really helps
• Modeling tools
• OWL invaluable, consider SHACL
Closed world on
top of open world
Open world
27. MAPPING TO AUTHORITIES
Mapping approaches:
• Simple match
• Fuzzy match (Soundex,
Levenshtein)
• Full text search
• Normalize then search/
match
• Concordance (TAMR etc)
• Ensemble of the above
28. STILL BLEEDING EDGE
• …but now being used in real world solutions
• Have clear goals
• Be prepared to change direction & solutions
• Getting easier as vendor solutions increase and mature
29. DON’T OVERTHINK ETL
• Doesn’t have to be within Hadoop
• Does have to be repeatable
• Pervert existing ETL to treat as 3 column table
• A RDF REST API can be sufficient
• But
• Has to fit with overarching IA
• Need to accommodate idempotence & determinism (can’t be different named
graph on each run)