Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable Knowledge Graphs using the New Distributed AllegroGraph 7

138 views

Published on

To support ubiquitous AI, a knowledge graph system will have to fuse and integrate data, not just in representation, but in context (ontologies, metadata, domain knowledge, terminology systems), and time (temporal relationships between components of data). The rich functional and contextual integration of multi-modal, predictive modeling, and artificial intelligence is what distinguishes AllegroGraph 7 as a modern, scalable, enterprise analytic platform.

AllegroGraph 7 is the first big temporal knowledge graph technology that encapsulates a novel entity-event model natively integrated with domain ontologies and metadata, and dynamic ways of setting the analytics lens on all entities in the system (patient, person, devices, transactions, events, and operations) as prime objects that can be the focus of an analytic (AI, ML, DL) process.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Scalable Knowledge Graphs using the New Distributed AllegroGraph 7

  1. 1. Scalable Knowledge Graphs with the new Distributed AllegroGraph 7.0 Dr. Jans Aasman (allegrograph.com) And Gruff 8.0 in the Browser
  2. 2. Topic for today: [1] Gruff 8.0 in the Browser. [2] AllegroGraph Free and Enterprise Edition on AWS [3] Distributed AllegroGraph • Our customers and partners believe in • Data Centric Computing • Entity Event based Knowledge Graphs • Distributed AllegroGraph supports DCC and EEKG • Some Benchmarking
  3. 3. Gruff 8.0 (aka: Gruff in the Browser) • A common customer request for the last years: Please port Gruff to the Browser. • We ported the graphics layer Gruff to HTML5, the same code base runs the standalone and GruffJS version • In many ways faster than the thick client [all the data copying happens locally on the server]
  4. 4. If you feel the urge to try it out:
  5. 5. Or work with it locally • Allegrograph.com/downloads
  6. 6. Distributed AllegroGraph 7.0 • Inspired by a telecom CRM use case with Amdocs • Heavily tested in production at Montefiore hospital • Now also use cases in a call center, finance, and marketing. • Most useful in you believe in • Data Centric Computing • And Entity/Event based Knowledge Graphs • Semantic Technology
  7. 7. Data Centric Computing Data centric refers to an architecture where data is the primary and permanent asset, and applications come and go. In the data centric architecture, the data model precedes the implementation of any given application and will be around and valid long after it is gone. Many think this is what happens now but it rarely happens this way. Businesses want functionality, and they purchase or build application systems. Each application system has its own data model, and its code is inextricably tied with this data model. It is extremely difficult to change the data model of an implemented application system. Of course, this application is only one of hundreds or thousands of such …. Each application on its own has hundreds to thousands of tables and tens of thousands of attributes.These applications are very partially and very unstably “interfaced” to one another through some middleware … The data centric approach turns all this on its head.There is a semantic data model and each bit of application functionality reads and writes through the shared model
  8. 8. Data Centric Computing and an event based knowledge graph in healthcare
  9. 9. Knowledge Graph Roadmap in Montefiore • Developed using grant funding (NHLBI, PCORI, and ICTR) and Intel/Franz collaboration • Fully integrated with Epic, and hosted by MIT Data Center. • Go-Live January 2017 with Respiratory Failure and Mortality Prediction à Prevention • C: Sepsis, HF, Spinal Cord Compression (etc) all with associated ROI • Wide and multi-disciplinary spectrum of applications confirms a platform approach 2017 2021 Respiratory Failure Mortality Sepsis Heart Failure Spinal Cord Compression Fractures, Osteoporosis Outpatient Appointments Malpractice ED utilization Discharge, Bed Reassignment Net Revenue Per Encounter Claim Rejection Prevention Proprietary and Confidential, Copyright 2018 © Montefiore Health System, All Rights Reserved
  10. 10. Events directly connect to Entity The KG provides radical simplification of the EDW schema: we turn everything into an EVENT • Healthcare: everything that can happen to a patient is a time based (sub) event: Check In, Check Out, Test, Diagnosis, Procedure, Medication administration, Medication order, Sensor reading for vital signs, Invoice, Bill payment, Non-bill payment, all insurance interactions. • Telecom: everything that happens with a telco user is a time based (sub) event: telephone call, sms, whatsapp, web site visit, location record, crm call, bill pay, non-bill pay … • Even your demographic features are events (names, address, etc) • From thousands of tables to one event table (well, graph)
  11. 11. Entity Event Knowledge Graphs implement DCC Core features of the EEKG • Entities and Events are represented as Hierarchical Trees • Entity-Event trees are logically and physically grouped • The tree terminates in a knowledge base for enterprise taxonomies, thesauri, and domain knowledge
  12. 12. Structured patient data combined with complex integrated terminologyPatient Data OMOP MTH Snomed CT ICD9/ICD10
  13. 13. The entity-event tree in Healthcare • Line 1
  14. 14. The entity-event tree in a Telco
  15. 15. Benefit: Easy to understand the model, easy to do complex queries
  16. 16. Benefit: provide 360 overview in milliseconds and in a one-liner
  17. 17. Benefit: One time mapping and ETL • But no big bang required: • EEKG can be built incrementally • Start with an ontology of your core entities and events • Start with a knowledge graph that contains your digital assets • Then map input-streams or table/columns to a small target ontology of entities and events.
  18. 18. Benefit: Provenance at every level
  19. 19. Security for every triple (edge)
  20. 20. Triple Attributes • Every triple can have arbitrary number of key/value pairs. • Meets requirements of government security model & HIPAA
  21. 21. And one incredible benefit • Enables horizontal scaling through sharding • 4th element or triple attribute becomes sharding key
  22. 22. Distributed AllegroGraph with Fedshard™ • Entity-Event trees go into shards with entity-id as shard key • Shards get federated with non-shardable knowledge bases.
  23. 23. First something about AllegroGraph Federation • Open multiple repositories at the same time • Use virtual cursors that issue ‘get-triple :s :p :o :g’ queries to each repo. • SPARQL doesn’t even need to know that it is talking to multiple sources dbpedia geonames Census 2000 Federated store sparql
  24. 24. Example of AllegroGraph Federation • What is the medium income of the area where Barack Obama was born? • What does it do: • Find in dbpedia Obama’s birth place and its geonameID • Then find in geonames the geonamesIDs within ten miles of the place above • Then find in census2000 the mediumIncomeAbove16YearsAndOlder for each of those IDs
  25. 25. A quick demo
  26. 26. Open up multiple repos at the same time
  27. 27. After opening it is about 1.6 B statements
  28. 28. We use this federation in AG-7.0
  29. 29. Shard #1 Knowledge Base #1 Federation #1 ShardShard Shard Shard Shard #n Knowledge Base #2 Knowledge Base #3 Federation #n Non Shardable Knowledge Shardable data Federation
  30. 30. AllegroGraph Distributed Architecture Server 1 Server 2 Server 3 ●● ●
  31. 31. Scalability experiment with hospital data Cluster: • The cluster consists of 8 nodes. Each node has 256 Gig of RAM and 10 T of spinning disk • Each node runs one AllegroGraph server, and per server we have 4 patient repositories or 'shards’ • Each shard contains about 550 M triples = 17.7 Billion triples total Supermicro: • 500 Gig of Ram and runs entirely on SSDs. We tested max IO and we got to 120,000 IOs per second • This is a single repository containing all patients plus the terminology system. HOWEVER: I deleted all the provenance triples because keeping those in the database made queries slower. About 4 billion triples.
  32. 32. Scalability Examples • Attrition: answer how many patients did we lose (for any reason) from the year 2010 to the year 2011. This is a medium complexity query as it contains some interesting unions, joins, and filters. • Demographics: look up demographics for all patients, low complexity. • Diagnostics: look up every diagnostic in the database for inpatient/outpatient encounters to feed diagnostics., low complexity. • CDRN: an example of a medium to high complexity ETL query used for the NYC CDRN project.
  33. 33. Scalability Numbers
  34. 34. Query performance on the 38 PC queries to feed machine learning pipeline • Yes: everything gets a lot faster on the cluster • We need to optimize based on # of partitions and # of cpus
  35. 35. So: can I check how busy the cluster is? 6 seconds of 38 proofcheck queries
  36. 36. Concluding [1]
  37. 37. Concluding [2] • If you believe in • The Data Centric Computing Approach • Entity Event Knowledge Graphs • Semantics • Then try it out for yourself • Call us and we get you on your way • And: try our demo server at: gruff.allegrograph.com:10035
  38. 38. Documents: JSON, JSON-LD Graphs: RDF, Quads, Properties Storage: Triple Attributes, Security Filters, Compression, Indexing, Full-text Transactions: “Real” ACID, 2 Phase Commit Management: Security, Multi-Master Replication, Backup/Restore, Warm Failover Stored Procs: JavaScript Lisp Prolog SPARQL Magic Predicates Reasoning: RDFS++ OWL2-RL Prolog Probabilistic NLP: Taxonomies Entity Extract Text Classify Sentiment Machine Learning Speech Recognition ETL: RDBMS CSV TEXT NoSQL Events: Geospatial Temporal Social REST GUI: GRUFF/AGWebView Java Python Lisp Built-In Integrations Cloud: Amazon AWS Microsoft Azure Data Science: Anaconda R Studio Knowledge: Linked Open Data Editors: Ontology, Taxonomy NoSQL: Cloudera, MongoDB, Solr Containers: Docker, VMWare Massively Parallel - Federation and Sharding OSS Clients SPARQL Prolog AllegroGraph Architecture

×