To support ubiquitous AI, a knowledge graph system will have to fuse and integrate data, not just in representation, but in context (ontologies, metadata, domain knowledge, terminology systems), and time (temporal relationships between components of data). The rich functional and contextual integration of multi-modal, predictive modeling, and artificial intelligence is what distinguishes AllegroGraph 7 as a modern, scalable, enterprise analytic platform.
AllegroGraph 7 is the first big temporal knowledge graph technology that encapsulates a novel entity-event model natively integrated with domain ontologies and metadata, and dynamic ways of setting the analytics lens on all entities in the system (patient, person, devices, transactions, events, and operations) as prime objects that can be the focus of an analytic (AI, ML, DL) process.
Scalable Knowledge Graphs using the New Distributed AllegroGraph 7
1. Scalable Knowledge Graphs
with the new
Distributed AllegroGraph 7.0
Dr. Jans Aasman
(allegrograph.com)
And Gruff 8.0 in the Browser
2. Topic for today:
[1] Gruff 8.0 in the Browser.
[2] AllegroGraph Free and Enterprise Edition on AWS
[3] Distributed AllegroGraph
• Our customers and partners believe in
• Data Centric Computing
• Entity Event based Knowledge Graphs
• Distributed AllegroGraph supports DCC and EEKG
• Some Benchmarking
3. Gruff 8.0 (aka: Gruff in the Browser)
• A common customer request for the last years:
Please port Gruff to the Browser.
• We ported the graphics layer Gruff to HTML5,
the same code base runs the standalone and GruffJS
version
• In many ways faster than the thick client
[all the data copying happens locally on the server]
7. Or work with it locally
• Allegrograph.com/downloads
8. Distributed AllegroGraph 7.0
• Inspired by a telecom CRM use case with Amdocs
• Heavily tested in production at Montefiore hospital
• Now also use cases in a call center, finance, and marketing.
• Most useful in you believe in
• Data Centric Computing
• And Entity/Event based Knowledge Graphs
• Semantic Technology
9. Data Centric Computing
Data centric refers to an architecture where data is the
primary and permanent asset, and applications come and
go. In the data centric architecture, the data model
precedes the implementation of any given application and
will be around and valid long after it is gone.
Many think this is what happens now but it rarely happens
this way. Businesses want functionality, and they purchase or
build application systems. Each application system has its
own data model, and its code is inextricably tied with this
data model. It is extremely difficult to change the data
model of an implemented application system.
Of course, this application is only one of hundreds or
thousands of such …. Each application on its own has
hundreds to thousands of tables and tens of thousands of
attributes.These applications are very partially and very
unstably “interfaced” to one another through some
middleware …
The data centric approach turns
all this on its head.There is a
semantic data model and each
bit of application functionality
reads and writes through the
shared model
13. Events directly connect to Entity
The KG provides radical simplification of the EDW schema:
we turn everything into an EVENT
• Healthcare: everything that can happen to a patient is a time based (sub) event:
Check In, Check Out, Test, Diagnosis, Procedure, Medication administration,
Medication order, Sensor reading for vital signs, Invoice, Bill payment, Non-bill
payment, all insurance interactions.
• Telecom: everything that happens with a telco user is a time based (sub) event:
telephone call, sms, whatsapp, web site visit, location record, crm call, bill pay,
non-bill pay …
• Even your demographic features are events (names, address, etc)
• From thousands of tables to one event table (well, graph)
14. Entity Event Knowledge Graphs implement DCC
Core features of the EEKG
• Entities and Events are represented as Hierarchical Trees
• Entity-Event trees are logically and physically grouped
• The tree terminates in a knowledge base for enterprise
taxonomies, thesauri, and domain knowledge
15. Structured patient data combined with complex
integrated terminologyPatient Data
OMOP
MTH
Snomed CT
ICD9/ICD10
21. Benefit: One time mapping and ETL
• But no big bang required:
• EEKG can be built
incrementally
• Start with an ontology of your core entities and events
• Start with a knowledge graph that contains your digital assets
• Then map input-streams or table/columns to a small target ontology of
entities and events.
24. Triple Attributes
• Every triple can have arbitrary number of key/value pairs.
• Meets requirements of government security model & HIPAA
25. And one incredible benefit
• Enables horizontal scaling through sharding
• 4th element or triple attribute becomes sharding key
26. Distributed AllegroGraph with Fedshard™
• Entity-Event trees go into shards
with entity-id as shard key
• Shards get federated with
non-shardable knowledge bases.
27. First something about AllegroGraph Federation
• Open multiple repositories at the same time
• Use virtual cursors that issue ‘get-triple :s :p :o :g’ queries to each repo.
• SPARQL doesn’t even need to know that it is talking to multiple sources
dbpedia geonames Census
2000
Federated store
sparql
28. Example of AllegroGraph Federation
• What is the medium income of the area where Barack
Obama was born?
• What does it do:
• Find in dbpedia Obama’s birth place and its geonameID
• Then find in geonames the geonamesIDs within ten miles of the place above
• Then find in census2000 the mediumIncomeAbove16YearsAndOlder for
each of those IDs
36. Scalability experiment with hospital data
Cluster:
• The cluster consists of 8 nodes. Each node has 256 Gig of RAM and 10 T of
spinning disk
• Each node runs one AllegroGraph server, and per server we have 4 patient
repositories or 'shards’
• Each shard contains about 550 M triples = 17.7 Billion triples total
Supermicro:
• 500 Gig of Ram and runs entirely on SSDs. We tested max IO and we got to
120,000 IOs per second
• This is a single repository containing all patients plus the terminology system.
HOWEVER: I deleted all the provenance triples because keeping those in the
database made queries slower. About 4 billion triples.
37. Scalability Examples
• Attrition: answer how many patients did we lose (for any reason) from the
year 2010 to the year 2011. This is a medium complexity query as it
contains some interesting unions, joins, and filters.
• Demographics: look up demographics for all patients, low complexity.
• Diagnostics: look up every diagnostic in the database for
inpatient/outpatient encounters to feed diagnostics., low complexity.
• CDRN: an example of a medium to high complexity ETL query used for
the NYC CDRN project.
39. Query performance on the 38 PC queries to feed
machine learning pipeline
• Yes: everything gets a lot faster on the cluster
• We need to optimize based on # of partitions and # of cpus
40. So: can I check how busy the cluster is?
6 seconds of
38 proofcheck
queries
42. Concluding [2]
• If you believe in
• The Data Centric Computing Approach
• Entity Event Knowledge Graphs
• Semantics
• Then try it out for yourself
• Call us and we get you on your way
• And: try our demo server at: gruff.allegrograph.com:10035