Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Overview of the Emerging Graph Landscape (Oct 2013)


Published on

Recent years have seen an explosion of technologies for managing, processing and analyzing graphs, ranging from community projects like Apache Giraph, to vendor led products such as Neo4j and spin outs from established companies like Twitter’s FlockDB. The sheer number of technologies makes it difficult to keep track of these tools and what sets them apart, even for those of us who are active in the space!

But all graph technologies are not created equal. This session will provide a high level framework for making sense of the emerging graph landscape. It will describe the three dominant graph data models today, define top level categories like graph compute engines (Graphlab, Giraph, Pegasus, YarcData, etc) and graph databases (Neo4j, FlockDB, OrientDB, etc) and discuss common characteristics and important properties of each category.

Published in: Technology
  • What Emil has done would most likely fit well together with the ONTONIX patented complexity measurement of structured systems and how far off the critical chaos limit they system actually is.
    Are you sure you want to  Yes  No
    Your message goes here
  • What Neo4J has achieved and done for explaining the power and possibilities of graph databases is remarkable and fantastic. Business wise it seems to me that they have also managed more than entire Semantic Web all together! Really they have put the Graph database as a concept and available technology on the map of Big Data.
    Again, congratulations and all the best to carry on.

    However, the generalization of RDF triple stores is not quite accurate in this overview.
    They use URI's to identify nodes, but then their building blocks are the same as Neo4J: nodes, directed typed relationships and key-value pairs on nodes. Correct is that there is no native support to handle relationships as first class citizens but can be achieved by graph design or using named graph. I will not mention about the features that RDF stores additionally have because of their ability to be enhanced by OWL.
    The difference at meta-meta level between Property Graphs and RDF stores are minor. SPARQL and Cypher are of same kind as well. The real difference is that RDF stores are standard based and that is quite big value from business investment perspective. I think the overview should mention that RDF graph databases ar the only ones based on standards. All others are proprietary.
    I would also argue about choice of RDF store vendor examples. AllegroGraph and Sesame? Leaving out the biggest and leaders Virtuoso, OWLIM, Garlik, IBM DB2, Oracle? Also, among RDF stores there are variety of different implementations, from SQL mapping to native graphs stores.
    Are you sure you want to  Yes  No
    Your message goes here

An Overview of the Emerging Graph Landscape (Oct 2013)

  1. 1. Neo Technology, Inc Confidential An Overview Of The Emerging Graph Landscape DataWeek Oct 2, 2013 Emil Eifrem @emileifrem #neo4j 1Wednesday, October 2, 13
  2. 2. Neo Technology, Inc Confidential Agenda 1. Why Graphs,Why Now? 2. What Is A Graph, Anyway? 3. Graphs In The Real World 4. The Graph Landscape i) Popular Graph Models ii) Graph Databases iii)Graph Compute Engines 2Wednesday, October 2, 13
  3. 3. Neo Technology, Inc Confidential Why Graphs, Why Now? 3Wednesday, October 2, 13
  4. 4. Graph Buzz 4Wednesday, October 2, 13
  5. 5. Neo Technology, Inc Confidential “Graph analysis is the true killer app for Big Data.” - Forrester Research, Dec 2011 Graph Buzz 5Wednesday, October 2, 13
  6. 6. Neo Technology, Inc Confidential “[I]t is arguable that graph databases will have a bigger impact on the database landscape than Hadoop or its competitors.” - Bloor Research, May 2012 Graph Buzz 6Wednesday, October 2, 13
  7. 7. Neo Technology, Inc Confidential Graph Buzz Ref: Copy of Gartner slide: 7Wednesday, October 2, 13
  8. 8. Neo Technology, Inc Confidential Graph Buzz 8Wednesday, October 2, 13
  9. 9. Evolution of Web Search Survival of the Fittest Pre-1999 WWW Indexing Discrete Data 1999 - 2012 Google Invents PageRank Connected Data (Simple) 2012-? Google Knowledge Graph, Facebook Graph Search Connected Data (Rich) 9Wednesday, October 2, 13
  10. 10. Evolution of Online Recruiting 1999 Keyword Search Discrete Data Survival of the Fittest 2011-12 Social Discovery Connected Data 10Wednesday, October 2, 13
  11. 11. Neo Technology, Inc Confidential What Is A Graph, Anyway? 11Wednesday, October 2, 13
  12. 12. Neo Technology, Inc Confidential 12Wednesday, October 2, 13
  13. 13. Neo Technology, Inc Confidential MATCH (philip:Person)-[:IS_FRIEND_OF]->(friend), (friend)-[:LIKES]->(restaurant), (restaurant)-[:LOCATED_IN]->(newyork:Location), (restaurant)-[:SERVES]->(sushi:Cuisine) WHERE = 'Philip' AND newyork.location='New York' AND sushi.cuisine='Sushi' RETURN * Cypher query language example 13Wednesday, October 2, 13
  14. 14. Neo Technology, Inc Confidential 14Wednesday, October 2, 13
  15. 15. Neo Technology, Inc Confidential What drugs will bind to protein X and not interact with drugY? Of course.. a graph is a graph is a graph 15Wednesday, October 2, 13
  16. 16. Network Management Example 16Wednesday, October 2, 13
  17. 17. Network Management - Create CREATE ! (crm {name:"CRM"}), ! (dbvm {name:"Database VM"}), ! (www {name:"Public Website"}), ! (wwwvm {name:"Webserver VM"}), ! (srv1 {name:"Server 1"}), ! (san {name:"SAN"}), ! (srv2 {name:"Server 2"}), ! (crm)-[:DEPENDS_ON]->(dbvm), ! (dbvm)-[:DEPENDS_ON]->(srv2), ! (srv2)-[:DEPENDS_ON]->(san), ! (www)-[:DEPENDS_ON]->(dbvm), ! (www)-[:DEPENDS_ON]->(wwwvm), ! (wwwvm)-[:DEPENDS_ON]->(srv1), ! (srv1)-[:DEPENDS_ON]->(san) Practical Cypher 17Wednesday, October 2, 13
  18. 18. Network Management - Impact Analysis // Server 1 Outage MATCH (n)<-[:DEPENDS_ON*]-(upstream) WHERE = "Server 1" RETURN upstream Practical Cypher upstream {name:"Webserver VM"} {name:"Public Website"} 18Wednesday, October 2, 13
  19. 19. Network Management - Dependency Analysis // Public website dependencies MATCH (n)-[:DEPENDS_ON*]->(downstream) WHERE = "Public Website" RETURN downstream Practical Cypher downstream {name:"Database VM"} {name:"Server 2"} {name:"SAN"} {name:"Webserver VM"} {name:"Server 1"} 19Wednesday, October 2, 13
  20. 20. Network Management - Statistics // Most depended on component MATCH (n)<-[:DEPENDS_ON*]-(dependent) RETURN n, count(DISTINCT dependent) AS dependents ORDER BY dependents DESC LIMIT 1 Practical Cypher n dependents {name:"SAN"} 6 20Wednesday, October 2, 13
  21. 21. Neo Technology, Inc Confidential Graphs In The Real World 21Wednesday, October 2, 13
  22. 22. Neo Technology, Inc Confidential Core Industries & Use Cases: Web / ISV Finance & Insurance Telecomm- unications Network & Data Center Management MDM Social Geo Early Adopter Segments (What we expected to happen - view from several years ago) 22Wednesday, October 2, 13
  23. 23. Neo Technology, Inc Confidential Core Industries & Use Cases: Web / ISV Finance & Insurance Telecomm- unications Network & Data Center Management MDM Social Geo Select Commercial Customers* Across Anticipated Segments Neo4j Adoption Snapshot Core Industries & Use Cases: Software Financial Services Telecomm unications Health Care & Life Sciences Web Social, HR & Recruiting Media & Publishing Energy, Services, Automotive, Gov’t, Logistics, Education, Gaming, Other Network & Data Center Management MDM / System of Record Social Geo Recommend- ations Identity & Access Mgmt Content Management BI, CRM, Impact Analysis, Fraud Detection, Resource Optimization, etc. Accenture Finance Energy Aerospace 23Wednesday, October 2, 13
  24. 24. Neo Technology, Inc Confidential • Network Graph (e.g. Network Dependency Analysis, Network Inventory, etc.) • Social Graph (mobile apps, social recommendations, collaboration) • Call Graph (creating inferred social graph, churn reduction, etc.) • Master Data Graph (org & product hierarchy, data governance, IAM) • Help Desk Graph (enterprise collaboration) 5 Graphs of Telco 24Wednesday, October 2, 13
  25. 25. Neo Technology, Inc Confidential • Payment Graph (e.g. Fraud Detection, Credit Risk Analysis, Chargebacks...) • Customer Graph (org drillthru, product recommendations, mobile payments, etc.) • Entitlement Graph (identity & access management, authorization) • Portfolio Graph (portfolio analytics, risk analysis, trading, compliance) • Master Data Graph (enterprise collaboration, corporate hierarchy, data governance) 5 Graphs of Finance Finance 25Wednesday, October 2, 13
  26. 26. Neo Technology, Inc Confidential • Provider Graph (e.g. referrals, patient management, research) • Patient Graph (support communities, doctor recommendations, clinical trials) • Bioinformatic Graph (drug research, genetic screening, plant engineering, etc.) • Master Data Graph (biological master data, evolutionary taxonomy, etc.) • Treatment Graph (collaborative medicine, clinical trials, etc.) 5 Graphs of Health Care 26Wednesday, October 2, 13
  27. 27. Accenture Background •One of the world’s largest logistics carriers •Projected to outgrow capacity of old system •New parcel routing system •Single source of truth for entire network •B2C & B2B parcel tracking •Real-time routing: up to 5M parcels per day Business problem •24x7 availability, year round •Peak loads of 2500+ parcels per second •Complex and diverse software stack •Need predictable performance & linear scalability •Daily changes to logistics network: route from any point, to any point Solution & Benefits •Neo4j provides the ideal domain fit: •a logistics network is a graph •Extreme availability & performance with Neo4j clustering •Hugely simplified queries, vs. relational for complex routing •Flexible data model can reflect real-world data variance much better than relational •“Whiteboard friendly” model easy to understand Industry: Logistics Use case: Parcel Routing Neo Technology Confidential 27Wednesday, October 2, 13
  28. 28. Neo Technology, Inc Confidential Industry: Online Job Search Use case: Social / Recommendations • Online jobs and career community, providing anonymized inside information to job seekers Business problem • Wanted to leverage known fact that most jobs are found through personal & professional connections • Needed to rely on an existing source of social network data. Facebook was the ideal choice. • End users needed to get instant gratification • Aiming to have the best job search service, in a very competitive market Solution & Benefits • First-to-market with a product that let users find jobs through their network of Facebook friends • Job recommendations served real-time from Neo4j • Individual Facebook graphs imported real-time into Neo4j • Glassdoor now stores > 50% of the entire Facebook social graph • Neo4j cluster has grown seamlessly, with new instances being brought online as graph size and load have increased Person Company KNOW S Person Person KNOWS Company KNOWS WORKS_AT WORKS_AT Background Sausalito, CA 28Wednesday, October 2, 13
  29. 29. Neo Technology, Inc Confidential The Graph Landscape 29Wednesday, October 2, 13
  30. 30. Neo Technology, Inc Confidential Overview of Popular Graph Data Models • Property Graph • Description: A “directed, labeled, attributed, multi- graph”1 which exposes three building blocks: nodes, typed relationships and key-value properties on both nodes and relationships • Vendors: Neo4j, OrientDB, InfiniteGraph, Dex • RDF Triples • Description: URI-centered subject-predicate-object triples as pioneered by the semantic web movement2 • Vendors: AllegroGraph, Sesame • HyperGraph • Description: A generalized graph where a relationship can connect an arbitrary amount of nodes (compared to the more common binary graph models)3 • Vendors: HyperGraphDB,TrinityDB 1] Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” 2010, 2] W3C,“The Resource Description Framework (RDF),” 2004, 3] Wikipedia, 30Wednesday, October 2, 13
  31. 31. Neo Technology, Inc Confidential Graph Ecosystem @ 10k Feet 1. Graph Databases 2. Graph Compute Engines 31Wednesday, October 2, 13
  32. 32. Neo Technology, Inc Confidential 1.What is a Graph Database A graph database is an online (“real-time”) database management system with CRUD methods that expose a graph data model1 • Two important properties: • Native graph processing, including index-free adjacency to facilitate traversals • Native graph storage engine, i.e. written from the ground up to be optimized for managing graph data 1] Robinson,Webber, Eifrem. Graph Databases. O’Reilly, 2013. p. 5. ISBN-10: 1449356265 32Wednesday, October 2, 13
  33. 33. Neo Technology, Inc Confidential Graph Local Queries e.g. Recommendations, Friend-of-Friend, Shortest Path Sweet Spot for Graph Databases 33Wednesday, October 2, 13
  34. 34. Neo Technology, Inc Confidential The Emerging Graph Database Space Graph Storage GraphProcessing N on- N ative Native Native FlockDB AllegroGraph The Graph Database Space 34Wednesday, October 2, 13
  35. 35. Neo Technology, Inc Confidential Processing platforms that enable graph global computational algorithms to be run against large data sets Graph Compute Engine (Working Storage) In-Memory Processing System(s) of Record Graph Compute Engine Data extraction, transformation, and load 2.What is a Graph Compute Engine 35Wednesday, October 2, 13
  36. 36. Neo Technology, Inc Confidential How many restaurants, on average, has each person liked? Graph Global Queries Sweet Spot for Graph Compute Engines 36Wednesday, October 2, 13
  37. 37. Neo Technology, Inc Confidential Graph Compute Engines • In-Memory / Single Machine • Distributed - most common of which is the “Bulk Synchronous Parallel Model” (aka Pregel clone) Largely fall into one of two patterns: 37Wednesday, October 2, 13
  38. 38. Neo Technology, Inc Confidential Distributed Computing Architecture - Examples Graph Compute Engine • Apache project based on Hadoop • Bulk Synchronous Processing Model (Pregel Clone) • Released in 2012 • OSS Project developed out of CMU • Based on Hadoop & Map/Reduce • Includes key algos for graph global pattern matching & visualization • OSS Project • Distributes relationships vs. nodes • Developed at CMU with funding from DARPA, Intel, et al. &VC 38Wednesday, October 2, 13
  39. 39. Neo Technology, Inc Confidential Cassovary • OSS Project led by Twitter • Used by Twitter for large- scale graph mining (uses daily export from FlockDB system of record) • “Not designed for persistence or database functionality”. YarcData uRiKA • Graph compute appliance launched by Cray in Feb 2012 • Build to discover unforeseen relationships in the graph In-Memory Single-Machine Examples Graph Compute Engine GraphChi • GraphLab Spinoff • Similar order-of-magnitude performance as GraphLab on a Mac Mini 39Wednesday, October 2, 13
  40. 40. Neo Technology, Inc Confidential Example Graph Database Deployment Application Other Databases ETL Graph Database Cluster Data Storage & Business Rules Execution Reporting Graph- Dashboards & Ad-hoc Analysis Graph Visualization End User Ad-hoc visual navigation & discovery Bulk Analytic Infrastructure (e.g. Graph Compute Engine) ETL Graph Mining & Aggregation Data Scientist Ad-Hoc Analysis 40Wednesday, October 2, 13
  41. 41. Neo Technology, Inc Confidential DEAR DATA SCIENTIST: TAKE THE RED PILL JOIN THE GRAPH. WE ARE HIRING. 41Wednesday, October 2, 13
  42. 42. Neo Technology, Inc Confidential teh end (sic) stay connected 42Wednesday, October 2, 13