More Related Content


More from Cambridge Semantics(20)


Semantic Graph Databases: The Evolution of Relational Databases

  1. Graph - Why, What, How Barry Zane Vice President, Engineering
  2. ©2017 Cambridge Semantics Inc. All rights reserved. The Journey • Why do relational guys become graph guys? – Relational is GREAT – “We shape our tools, and are shaped by our tools” • When all you have is a hammer… – Why Graph is the next evolution • Do more, easier, faster, cheaper
  3. ©2017 Cambridge Semantics Inc. All rights reserved. Real Relational Data Warehouse, Really • Relational Databases are predefined “rectangular” tables and rows with columns. – Very natural for subjects (aka rows) with a number of known attributes common to all/most of the subjects. – Allows columns to be links (aka keys) to other table’s subjects. • Challenged by: – Sparsity – One-to-many needs a separate “join table” – You need to understand the data in advance • Graphs are real relational, really. Just a little different than the points above!
  4. ©2017 Cambridge Semantics Inc. All rights reserved. Nodes/Subjects, Edges/Attributes, Values/Objects • Pretty picture, but what does it mean? • What is the data model?
  5. ©2017 Cambridge Semantics Inc. All rights reserved. RDF Triples - Like Key-value Pairs (heterogenous, unique, atomic, simple) JoeSmith LivesIn SanDiego JoeSmith BirthDate 9/17/1975 JoeSmith IsSpouse MaryJones JoeSmith HasChild BillSmith JoeSmith HasChild JaneSmith JoeSmith Attended EDW2016 JoeSmith Hobby “Hiking” JoeSmith Bought Pants962 MaryJones LivesIn SanDiego MaryJones BirthDate 7/10/1975 MaryJones IsSpouse JoeSmith MaryJones HasChild BillSmith MaryJones HasChild JaneSmith MaryJones Attended Commicon16 MaryJones Bought Sweater48 MaryJones NickName “MJ” ... Pants962 SKU 1934758967 Pants962 Color Brown Pants962 Inseam 32 Pants962 Size 36 Pants962 BoughtBy JoeSmith Pants962 BoughtBy MikeDoe Sweater48 SKU 1963095898 Sweater48 Color Red Sweater48 Size 6 Sweater48 BoughtBy MaryJones SanDiego Pop 2456824 SanDiego Team Chargers SanDiego Team Padres SanDiego Climate “Perfect” ... (RDF stands for Resource Description Format… Triples!)
  6. ©2017 Cambridge Semantics Inc. All rights reserved. SPARQL… Like SQL, but... • No explicit schema. The Ontology (fancy word for schema) is explicit in the data. • Further ontology information may also be called out in the data, such as inference rules. • Standard SQL aggregates, joins, etc, but simple and powerful relationship capabilities. • “How is Joe related to Mary” – In SQL Relational • Are they spouses? • Are they siblings? • Are they friends? • Do they have the same hobby? • … enumerate the choices, EXPLODES with degrees of separation – In SPARQL Graph • How is Joe related to Mary? • … you can directly specify degrees of separation • Pretty exciting, essentially all the power of SQL, but you can do more, with more diverse data, where the data tells you about itself, rather than you knowing in advance.
  7. ©2017 Cambridge Semantics Inc. All rights reserved. There Will Be a Quiz (Not Really) “How is Joe related to Mary?” SELECT * WHERE JoeSmith $connection MaryJones JoeSmith IsSpouse MaryJones “What do Joe and Mary have in common, to the first degree?” SELECT $connection COUNT(*) WHERE JoeSmith $connection $thing MaryJones $connection $thing GROUP BY $connection FriendsWith 45 Attended 342 LivesIn 1
  8. ©2017 Cambridge Semantics Inc. All rights reserved. And Yes, Standard SQL Analytics “What is the population and average personal income of each city?” SELECT $city count($person) avg($income) WHERE $person LivesIn $city $person Earns $income GROUP BY $city ORDER BY $city Atlanta 647,465 34,459 Boston 856,123 42,654 Chicago 1,456,589 39,475
  9. ©2017 Cambridge Semantics Inc. All rights reserved. RDF Opens a World of Data and Relationships • Created by the World Wide Web Consortium (W3C), the folks that bring us HTTP, HTML, XML, etc. Geared to the vast quantity and richness of the Internet. • Businesses and other organizations have much richer and varied data than they have been able to work with. • The trend has been toward the Data Swamp - bring it together and hope something can be gleaned from it. • RDF Triples are a simple way to describe and query nearly anything, even unstructured material. • Shameless Plugs: – Anzo Smart Data Lake - overlay layer on the data swamp to get meaning. – Anzo Smart Data Integration - ETL into SDL to make the swampy mess useful, without losing details. Applies semantic (aka schema) annotation & structure. – Anzo Graph Engine - Analytics at scale on the SDL at interactive speeds. – Anzo On the Web - Query & Visualize the results, without knowing SPARQL
  10. ©2017 Cambridge Semantics Inc. All rights reserved. If This Is So Great, What Took So Long? • We’ve understood graph for a while, but graph had: – Terrible performance at scale. – No application building/visualization tools for non-programmers. – No ETL support. – Too much hubris. – Too much “NoSQL” noise in the channel. “If you want to teach people a new way of thinking, don’t bother trying to teach them. Instead, given them a tool, thus use of which will lead to new ways of thinking.” R. Buckminster Fuller … but graph is not a new way of thinking, it maps to how you already think!
  11. ©2017 Cambridge Semantics Inc. All rights reserved. Single Database Instance Across Many Nodes • Behaves just like a single-node database, but faster • More speed and more data by clustering • Massively Parallel Processing - each CPU ‘owns’ a slice of the data that it operates primarily on. Data is re-sliced as intermediate results during the query. • Not a new concept, has been evolving since the 1980’s… Teradata, Netezza, Redshift, Hadoop...
  12. ©2017 Cambridge Semantics Inc. All rights reserved. Data Lake Subsets • The lake is the “database” • Multiple Graph Query Engine instances, usually on subsets • Short term instances - load, query, toss
  13. ©2017 Cambridge Semantics Inc. All rights reserved. It Is All About Speed at Scale • Your Time can never be recovered. Your customers will find a better vendor. Your patients will not thrive. The bad guys will win. • How to get speed: – Leverage well-understood MPP concepts – Lessons from Netezza and Paraccel, technology is an evolution • Understand and engineer for the interconnect. • Memory is far faster than disk, so compress and be in-memory. • Memory is slow, run ‘close to the silicon’ by using dynamically generated code… try to do everything in machine registers. – Design specifically for Graph • Similar to relational, but different. • Engineer for Dynamic, Heterogenous data typing. – But, keep it simple to deploy and use. • Done right, can be hundreds of times faster than other implementations or thousands of times faster than other big data approaches
  14. ©2017 Cambridge Semantics Inc. All rights reserved. Illustrative Pharma Company Use Case
  15. ©2017 Cambridge Semantics Inc. All rights reserved. Click here to view the full webinar