Scalable Knowledge Graphs using the New Distributed AllegroGraph 7

Scalable Knowledge Graphs
with the new
Distributed AllegroGraph 7.0
Dr. Jans Aasman
(allegrograph.com)
And Gruff 8.0 in the Browser

Topic for today:
[1] Gruff 8.0 in the Browser.
[2] AllegroGraph Free and Enterprise Edition on AWS
[3] Distributed AllegroGraph
• Our customers and partners believe in
• Data Centric Computing
• Entity Event based Knowledge Graphs
• Distributed AllegroGraph supports DCC and EEKG
• Some Benchmarking

Gruff 8.0 (aka: Gruff in the Browser)
• A common customer request for the last years:
Please port Gruff to the Browser.
• We ported the graphics layer Gruff to HTML5,
the same code base runs the standalone and GruffJS
version
• In many ways faster than the thick client
[all the data copying happens locally on the server]

If you feel the urge to try it out:

Or work with it locally
• Allegrograph.com/downloads

Distributed AllegroGraph 7.0
• Inspired by a telecom CRM use case with Amdocs
• Heavily tested in production at Montefiore hospital
• Now also use cases in a call center, finance, and marketing.
• Most useful in you believe in
• Data Centric Computing
• And Entity/Event based Knowledge Graphs
• Semantic Technology

Data Centric Computing
Data centric refers to an architecture where data is the
primary and permanent asset, and applications come and
go. In the data centric architecture, the data model
precedes the implementation of any given application and
will be around and valid long after it is gone.
Many think this is what happens now but it rarely happens
this way. Businesses want functionality, and they purchase or
build application systems. Each application system has its
own data model, and its code is inextricably tied with this
data model. It is extremely difficult to change the data
model of an implemented application system.
Of course, this application is only one of hundreds or
thousands of such …. Each application on its own has
hundreds to thousands of tables and tens of thousands of
attributes.These applications are very partially and very
unstably “interfaced” to one another through some
middleware …
The data centric approach turns
all this on its head.There is a
semantic data model and each
bit of application functionality
reads and writes through the
shared model

Data Centric Computing and an event based
knowledge graph in healthcare

Knowledge Graph Roadmap in Montefiore
• Developed using grant funding (NHLBI, PCORI, and ICTR) and Intel/Franz collaboration
• Fully integrated with Epic, and hosted by MIT Data Center.
• Go-Live January 2017 with Respiratory Failure and Mortality Prediction à Prevention
• C: Sepsis, HF, Spinal Cord Compression (etc) all with associated ROI
• Wide and multi-disciplinary spectrum of applications confirms a platform approach
2017 2021
Respiratory Failure
Mortality
Sepsis
Heart Failure
Spinal Cord
Compression
Fractures,
Osteoporosis
Outpatient
Appointments
Malpractice
ED utilization
Discharge,
Bed Reassignment
Net Revenue
Per Encounter
Claim Rejection
Prevention
Proprietary and Confidential, Copyright 2018 © Montefiore Health System, All Rights Reserved

Events directly connect to Entity
The KG provides radical simplification of the EDW schema:
we turn everything into an EVENT
• Healthcare: everything that can happen to a patient is a time based (sub) event:
Check In, Check Out, Test, Diagnosis, Procedure, Medication administration,
Medication order, Sensor reading for vital signs, Invoice, Bill payment, Non-bill
payment, all insurance interactions.
• Telecom: everything that happens with a telco user is a time based (sub) event:
telephone call, sms, whatsapp, web site visit, location record, crm call, bill pay,
non-bill pay …
• Even your demographic features are events (names, address, etc)
• From thousands of tables to one event table (well, graph)

Entity Event Knowledge Graphs implement DCC
Core features of the EEKG
• Entities and Events are represented as Hierarchical Trees
• Entity-Event trees are logically and physically grouped
• The tree terminates in a knowledge base for enterprise
taxonomies, thesauri, and domain knowledge

Structured patient data combined with complex
integrated terminologyPatient Data
OMOP
MTH
Snomed CT
ICD9/ICD10

The entity-event tree in Healthcare
• Line 1

The entity-event tree in a Telco

Benefit: Easy to understand the model, easy to do
complex queries

Benefit: provide 360 overview in milliseconds and in a
one-liner

Benefit: One time mapping and ETL
• But no big bang required:
• EEKG can be built
incrementally
• Start with an ontology of your core entities and events
• Start with a knowledge graph that contains your digital assets
• Then map input-streams or table/columns to a small target ontology of
entities and events.

Benefit: Provenance at every level

Security for every triple (edge)

Triple Attributes
• Every triple can have arbitrary number of key/value pairs.
• Meets requirements of government security model & HIPAA

And one incredible benefit
• Enables horizontal scaling through sharding
• 4th element or triple attribute becomes sharding key

Distributed AllegroGraph with Fedshard™
• Entity-Event trees go into shards
with entity-id as shard key
• Shards get federated with
non-shardable knowledge bases.

First something about AllegroGraph Federation
• Open multiple repositories at the same time
• Use virtual cursors that issue ‘get-triple :s :p :o :g’ queries to each repo.
• SPARQL doesn’t even need to know that it is talking to multiple sources
dbpedia geonames Census
2000
Federated store
sparql

Example of AllegroGraph Federation
• What is the medium income of the area where Barack
Obama was born?
• What does it do:
• Find in dbpedia Obama’s birth place and its geonameID
• Then find in geonames the geonamesIDs within ten miles of the place above
• Then find in census2000 the mediumIncomeAbove16YearsAndOlder for
each of those IDs

Open up multiple repos at the same time

After opening it is about 1.6 B statements

We use this federation in AG-7.0

Shard
#1
Knowledge Base #1
Federation
#1
ShardShard Shard Shard
Shard
#n
Knowledge Base #2 Knowledge Base #3
Federation
#n
Non Shardable
Knowledge
Shardable data
Federation

AllegroGraph Distributed Architecture
Server 1 Server 2 Server 3
●●
●

Scalability experiment with hospital data
Cluster:
• The cluster consists of 8 nodes. Each node has 256 Gig of RAM and 10 T of
spinning disk
• Each node runs one AllegroGraph server, and per server we have 4 patient
repositories or 'shards’
• Each shard contains about 550 M triples = 17.7 Billion triples total
Supermicro:
• 500 Gig of Ram and runs entirely on SSDs. We tested max IO and we got to
120,000 IOs per second
• This is a single repository containing all patients plus the terminology system.
HOWEVER: I deleted all the provenance triples because keeping those in the
database made queries slower. About 4 billion triples.

Scalability Examples
• Attrition: answer how many patients did we lose (for any reason) from the
year 2010 to the year 2011. This is a medium complexity query as it
contains some interesting unions, joins, and filters.
• Demographics: look up demographics for all patients, low complexity.
• Diagnostics: look up every diagnostic in the database for
inpatient/outpatient encounters to feed diagnostics., low complexity.
• CDRN: an example of a medium to high complexity ETL query used for
the NYC CDRN project.

Query performance on the 38 PC queries to feed
machine learning pipeline
• Yes: everything gets a lot faster on the cluster
• We need to optimize based on # of partitions and # of cpus

So: can I check how busy the cluster is?
6 seconds of
38 proofcheck
queries

Concluding [2]
• If you believe in
• The Data Centric Computing Approach
• Entity Event Knowledge Graphs
• Semantics
• Then try it out for yourself
• Call us and we get you on your way
• And: try our demo server at: gruff.allegrograph.com:10035

Documents: JSON, JSON-LD Graphs: RDF, Quads, Properties
Storage: Triple Attributes, Security Filters, Compression, Indexing, Full-text
Transactions: “Real” ACID, 2 Phase Commit
Management: Security, Multi-Master Replication, Backup/Restore, Warm Failover
Stored Procs:
JavaScript
Lisp
Prolog
SPARQL
Magic Predicates
Reasoning:
RDFS++
OWL2-RL
Prolog
Probabilistic
NLP:
Taxonomies
Entity Extract
Text Classify
Sentiment
Machine
Learning
Speech
Recognition
ETL:
RDBMS
CSV
TEXT
NoSQL
Events:
Geospatial
Temporal
Social
REST GUI: GRUFF/AGWebView
Java Python Lisp
Built-In Integrations
Cloud:
Amazon AWS
Microsoft Azure
Data Science:
Anaconda
R Studio
Knowledge:
Linked Open Data
Editors:
Ontology, Taxonomy
NoSQL:
Cloudera, MongoDB,
Solr
Containers:
Docker, VMWare
Massively Parallel - Federation and Sharding
OSS Clients
SPARQL Prolog
AllegroGraph Architecture

Scalable Knowledge Graphs using the New Distributed AllegroGraph 7

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Scalable Knowledge Graphs using the New Distributed AllegroGraph 7