Recent years have seen an explosion of technologies for managing, processing and analyzing graphs, ranging from community projects like Apache Giraph, to vendor led products such as Neo4j and spin outs from established companies like Twitter’s FlockDB. The sheer number of technologies makes it difficult to keep track of these tools and what sets them apart, even for those of us who are active in the space!
But all graph technologies are not created equal. This session will provide a high level framework for making sense of the emerging graph landscape. It will describe the three dominant graph data models today, define top level categories like graph compute engines (Graphlab, Giraph, Pegasus, YarcData, etc) and graph databases (Neo4j, FlockDB, OrientDB, etc) and discuss common characteristics and important properties of each category.
1. Neo Technology, Inc Confidential
An Overview Of The
Emerging Graph Landscape
DataWeek Oct 2, 2013
Emil Eifrem
emil@neotechnology.com
@emileifrem
#neo4j
1Wednesday, October 2, 13
2. Neo Technology, Inc Confidential
Agenda
1. Why Graphs,Why Now?
2. What Is A Graph, Anyway?
3. Graphs In The Real World
4. The Graph Landscape
i) Popular Graph Models
ii) Graph Databases
iii)Graph Compute Engines
2Wednesday, October 2, 13
5. Neo Technology, Inc Confidential
“Graph analysis is the true killer app for Big Data.”
- Forrester Research, Dec 2011
http://blogs.forrester.com/james_kobielus/11-12-19-the_year_ahead_in_big_data_big_cool_new_stuff_looms_large
Graph Buzz
5Wednesday, October 2, 13
6. Neo Technology, Inc Confidential
“[I]t is arguable that graph databases will have a
bigger impact on the database landscape than
Hadoop or its competitors.”
- Bloor Research, May 2012
http://www.bloorresearch.com/blog/IM-Blog/2012/5/graph-databases-nosql.html
Graph Buzz
6Wednesday, October 2, 13
7. Neo Technology, Inc Confidential
Graph Buzz
Ref: http://www.gartner.com/id=2081316
Copy of Gartner slide:
7Wednesday, October 2, 13
9. Evolution of Web Search
Survival of the Fittest
Pre-1999
WWW Indexing
Discrete Data
1999 - 2012
Google Invents
PageRank
Connected Data
(Simple)
2012-?
Google Knowledge Graph,
Facebook Graph Search
Connected Data
(Rich)
9Wednesday, October 2, 13
10. Evolution of Online Recruiting
1999
Keyword Search
Discrete Data
Survival of the Fittest
2011-12
Social Discovery
Connected Data
10Wednesday, October 2, 13
11. Neo Technology, Inc Confidential
What Is A Graph,
Anyway?
11Wednesday, October 2, 13
13. Neo Technology, Inc Confidential
MATCH (philip:Person)-[:IS_FRIEND_OF]->(friend),
(friend)-[:LIKES]->(restaurant),
(restaurant)-[:LOCATED_IN]->(newyork:Location),
(restaurant)-[:SERVES]->(sushi:Cuisine)
WHERE philip.name = 'Philip' AND newyork.location='New York' AND
sushi.cuisine='Sushi'
RETURN restaurant.name
* Cypher query language examplehttp://maxdemarzi.com/?s=facebook
13Wednesday, October 2, 13
15. Neo Technology, Inc Confidential
What drugs will bind to protein X and not interact with drugY?
Of course.. a graph is a graph is a graph
15Wednesday, October 2, 13
18. Network Management - Impact Analysis
// Server 1 Outage
MATCH (n)<-[:DEPENDS_ON*]-(upstream)
WHERE n.name = "Server 1"
RETURN upstream
Practical Cypher
upstream
{name:"Webserver VM"}
{name:"Public Website"}
18Wednesday, October 2, 13
19. Network Management - Dependency Analysis
// Public website dependencies
MATCH (n)-[:DEPENDS_ON*]->(downstream)
WHERE n.name = "Public Website"
RETURN downstream
Practical Cypher
downstream
{name:"Database VM"}
{name:"Server 2"}
{name:"SAN"}
{name:"Webserver VM"}
{name:"Server 1"}
19Wednesday, October 2, 13
20. Network Management - Statistics
// Most depended on component
MATCH (n)<-[:DEPENDS_ON*]-(dependent)
RETURN n,
count(DISTINCT dependent)
AS dependents
ORDER BY dependents DESC
LIMIT 1
Practical Cypher
n dependents
{name:"SAN"} 6
20Wednesday, October 2, 13
21. Neo Technology, Inc Confidential
Graphs In The
Real World
21Wednesday, October 2, 13
22. Neo Technology, Inc Confidential
Core Industries
& Use Cases:
Web / ISV
Finance &
Insurance
Telecomm-
unications
Network & Data
Center Management
MDM
Social
Geo
Early Adopter Segments
(What we expected to happen - view from several years ago)
22Wednesday, October 2, 13
23. Neo Technology, Inc Confidential
Core Industries
& Use Cases:
Web / ISV
Finance &
Insurance
Telecomm-
unications
Network & Data
Center Management
MDM
Social
Geo
Select Commercial Customers* Across Anticipated Segments
Neo4j Adoption Snapshot
Core Industries
& Use Cases:
Software
Financial
Services
Telecomm
unications
Health Care &
Life Sciences
Web Social,
HR & Recruiting
Media &
Publishing
Energy, Services,
Automotive, Gov’t,
Logistics, Education,
Gaming, Other
Network & Data
Center
Management
MDM / System of
Record
Social
Geo
Recommend-
ations
Identity &
Access Mgmt
Content
Management
BI, CRM, Impact
Analysis, Fraud
Detection, Resource
Optimization, etc.
Accenture
Finance
Energy Aerospace
23Wednesday, October 2, 13
24. Neo Technology, Inc Confidential
• Network Graph
(e.g. Network Dependency Analysis, Network Inventory, etc.)
• Social Graph
(mobile apps, social recommendations, collaboration)
• Call Graph
(creating inferred social graph, churn reduction, etc.)
• Master Data Graph
(org & product hierarchy, data governance, IAM)
• Help Desk Graph
(enterprise collaboration)
5 Graphs of Telco
24Wednesday, October 2, 13
26. Neo Technology, Inc Confidential
• Provider Graph
(e.g. referrals, patient management, research)
• Patient Graph
(support communities, doctor recommendations, clinical trials)
• Bioinformatic Graph
(drug research, genetic screening, plant engineering, etc.)
• Master Data Graph
(biological master data, evolutionary taxonomy, etc.)
• Treatment Graph
(collaborative medicine, clinical trials, etc.)
5 Graphs of Health Care
26Wednesday, October 2, 13
27. Accenture
Background
•One of the world’s largest logistics carriers
•Projected to outgrow capacity of old system
•New parcel routing system
•Single source of truth for entire network
•B2C & B2B parcel tracking
•Real-time routing: up to 5M parcels per day
Business problem
•24x7 availability, year round
•Peak loads of 2500+ parcels per second
•Complex and diverse software stack
•Need predictable performance & linear
scalability
•Daily changes to logistics network: route from
any point, to any point
Solution & Benefits
•Neo4j provides the ideal domain fit:
•a logistics network is a graph
•Extreme availability & performance with Neo4j
clustering
•Hugely simplified queries, vs. relational for
complex routing
•Flexible data model can reflect real-world data
variance much better than relational
•“Whiteboard friendly” model easy to understand
Industry: Logistics
Use case: Parcel Routing
Neo Technology Confidential
27Wednesday, October 2, 13
28. Neo Technology, Inc Confidential
Industry: Online Job Search
Use case: Social / Recommendations
• Online jobs and career community, providing
anonymized inside information to job seekers
Business problem
• Wanted to leverage known fact that most jobs are
found through personal & professional connections
• Needed to rely on an existing source of social
network data. Facebook was the ideal choice.
• End users needed to get instant gratification
• Aiming to have the best job search service, in a very
competitive market
Solution & Benefits
• First-to-market with a product that let users find jobs
through their network of Facebook friends
• Job recommendations served real-time from Neo4j
• Individual Facebook graphs imported real-time into Neo4j
• Glassdoor now stores > 50% of the entire Facebook
social graph
• Neo4j cluster has grown seamlessly, with new instances
being brought online as graph size and load have increased
Person
Company
KNOW
S
Person
Person
KNOWS
Company
KNOWS
WORKS_AT
WORKS_AT
Background
Sausalito, CA
28Wednesday, October 2, 13
29. Neo Technology, Inc Confidential
The Graph Landscape
29Wednesday, October 2, 13
30. Neo Technology, Inc Confidential
Overview of Popular
Graph Data Models
• Property Graph
• Description: A “directed, labeled, attributed, multi-
graph”1 which exposes three building blocks: nodes, typed
relationships and key-value properties on both nodes and
relationships
• Vendors: Neo4j, OrientDB, InfiniteGraph, Dex
• RDF Triples
• Description: URI-centered subject-predicate-object
triples as pioneered by the semantic web movement2
• Vendors: AllegroGraph, Sesame
• HyperGraph
• Description: A generalized graph where a
relationship can connect an arbitrary amount of nodes
(compared to the more common binary graph models)3
• Vendors: HyperGraphDB,TrinityDB
1] Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” 2010, http://arxiv.org/abs/1006.2361
2] W3C,“The Resource Description Framework (RDF),” 2004, http://www.w3.org/RDF/
3] Wikipedia, http://en.wikipedia.org/wiki/Hypergraph
30Wednesday, October 2, 13
32. Neo Technology, Inc Confidential
1.What is a
Graph Database
A graph database is an online (“real-time”)
database management system with CRUD
methods that expose a graph data model1
• Two important properties:
• Native graph processing, including
index-free adjacency to facilitate traversals
• Native graph storage engine, i.e.
written from the ground up to be
optimized for managing graph data
1] Robinson,Webber, Eifrem. Graph Databases. O’Reilly, 2013. p. 5. ISBN-10: 1449356265
32Wednesday, October 2, 13
33. Neo Technology, Inc Confidential
Graph Local Queries
e.g. Recommendations, Friend-of-Friend, Shortest Path
Sweet Spot for Graph Databases
33Wednesday, October 2, 13
34. Neo Technology, Inc Confidential
The Emerging
Graph Database Space
Graph Storage
GraphProcessing
N
on-
N
ative
Native
Native
FlockDB
AllegroGraph
The Graph
Database Space
34Wednesday, October 2, 13
35. Neo Technology, Inc Confidential
Processing platforms that enable graph global
computational algorithms to be run against
large data sets
Graph
Compute
Engine
(Working Storage)
In-Memory Processing
System(s)
of Record
Graph Compute
Engine
Data extraction,
transformation,
and load
2.What is a Graph
Compute Engine
35Wednesday, October 2, 13
36. Neo Technology, Inc Confidential
How many restaurants, on average, has each person liked?
Graph Global Queries
Sweet Spot for Graph Compute Engines
36Wednesday, October 2, 13
37. Neo Technology, Inc Confidential
Graph Compute Engines
• In-Memory / Single Machine
• Distributed - most common of which is the
“Bulk Synchronous Parallel Model” (aka
Pregel clone)
Largely fall into one of two patterns:
37Wednesday, October 2, 13
38. Neo Technology, Inc Confidential
Distributed Computing Architecture - Examples
Graph Compute Engine
• Apache project based on
Hadoop
• Bulk Synchronous
Processing Model
(Pregel Clone)
• Released in 2012 • OSS Project developed out of CMU
• Based on Hadoop & Map/Reduce
• Includes key algos for graph global
pattern matching & visualization
• OSS Project
• Distributes relationships vs. nodes
• Developed at CMU with funding
from DARPA, Intel, et al. &VC
38Wednesday, October 2, 13
39. Neo Technology, Inc Confidential
Cassovary
• OSS Project led by Twitter
• Used by Twitter for large-
scale graph mining (uses
daily export from FlockDB
system of record)
• “Not designed for
persistence or database
functionality”.
YarcData uRiKA
• Graph compute appliance
launched by Cray in Feb 2012
• Build to discover unforeseen
relationships in the graph
In-Memory Single-Machine Examples
Graph Compute Engine
GraphChi
• GraphLab Spinoff
• Similar order-of-magnitude
performance as GraphLab on
a Mac Mini
39Wednesday, October 2, 13
40. Neo Technology, Inc Confidential
Example Graph Database Deployment
Application
Other
Databases
ETL
Graph
Database
Cluster
Data Storage &
Business Rules Execution
Reporting
Graph-
Dashboards
&
Ad-hoc
Analysis
Graph
Visualization
End User Ad-hoc visual navigation &
discovery
Bulk Analytic
Infrastructure
(e.g. Graph Compute
Engine)
ETL
Graph Mining &
Aggregation
Data Scientist
Ad-Hoc
Analysis
40Wednesday, October 2, 13
41. Neo Technology, Inc Confidential
DEAR DATA SCIENTIST: TAKE THE RED PILL
JOIN THE GRAPH. WE ARE HIRING.
41Wednesday, October 2, 13
42. Neo Technology, Inc Confidential
teh end (sic)
stay connected
42Wednesday, October 2, 13