OVERVIEW
Why it makes sense to know about graph databases
„Graph databases will come into vogue. One key gap in the Hadoop
ecosystem is for graph databases, which support rich mining and visualization of
relationships, influence, and behavioral propensities. The market for graph
databases will boom in 2012 as companies everywhere adopt them for social
media analytics, marketing campaign optimization, and customer experience
fine-tuning. We will see VCs put big money behind graph database and analytics
startups. Many big data platform and tool vendors will acquire the startups to
supplement their expanding Hadoop, NoSQL, and enterprise data warehousing
(EDW) portfolios. Social graph analysis, although not a brand-new field, will
become one of the most prestigious specialties in the data science arena,
focusing on high-powered drilldown into polystructured behavioral data sets.“
Source: http://blogs.forrester.com/james_kobielus/11-12-19-the_year_ahead_in_big_data_big_cool_new_stuff_looms_large
3
OVERVIEW
Example of a real-world graph - facebook
Source: http://www.facebook.com/press/info.php?statistics
4
OVERVIEW
Example of a real-world graph - NYT „Cascade“
Source: http://nytlabs.com/projects/cascade.html
5
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person
Id Name
0 Henning Rauch
1 René Peinl
2 Foo Bar
3 Bruce Schneier
4 Linus Torwalds
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person
2
Id Name 3
0 Henning Rauch
1 René Peinl
2 Foo Bar
3 Bruce Schneier
4
4 Linus Torwalds
1
0
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person Knows_rel
2
Id Name Id_1 Id_2 3
0 Henning Rauch 1 0
1 René Peinl 1 2
2 Foo Bar 1 3
3 Bruce Schneier 1 4
4
4 Linus Torwalds 0 1 1
0 2
0 3
0 4
3 4
4 3 0
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person Knows_rel
2
Id Name Id_1 Id_2 3
0 Henning Rauch 1 0
1 René Peinl 1 2
2 Foo Bar 1 3
3 Bruce Schneier 1 4
4
4 Linus Torwalds 0 1 1
0 2
0 3
0 4
3 4
4 3 0
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person Knows_rel
2
Id Name Id_1 Id_2 3
0 Henning Rauch 1 0
1 René Peinl 1 2
2 Foo Bar 1 3
3 Bruce Schneier 1 4
4
4 Linus Torwalds 0 1 1
0 2
0 3
0 4
3 4
4 3 0
Tag
Id Name
0 .NET
1 Java
2 PKI
3 NoSQL
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person Knows_rel
2
Id Name Id_1 Id_2 3
0 Henning Rauch Java
1 0
1 René Peinl 1 2
2 Foo Bar 1 3
3 Bruce Schneier 1 4
4
4 Linus Torwalds 0 1 1
0 2
0 3
0 4
3 4
4 3 0
Tag NoSQL
Id Name .NET
0 .NET
PKI
1 Java
2 PKI
3 NoSQL
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person Knows_rel
2
Id Name Id_1 Id_2 3
0 Henning Rauch Java
1 0
1 René Peinl 1 2
2 Foo Bar 1 3
3 Bruce Schneier 1 4
4
4 Linus Torwalds 0 1 1
0 2
0 3
0 4
3 4
4 3 0
Tag Tags_rel NoSQL
Id Name Tag_Id Person_Id Significance .NET
0 .NET 0 0 5
PKI
1 Java 1 1 5
2 PKI 2 1 6
3 NoSQL 2 3 10
3 0 7
3 1 7
7
OVERVIEW
Delimitation to RDBMS - property graph
RDBMS GraphDB
Person Knows_rel
2
Id Name Id_1 Id_2 3
0 Henning Rauch Java
1 0
1 René Peinl 1 2
10
2 Foo Bar 1 3
5
3 Bruce Schneier 1 4
4
4 Linus Torwalds 0 1 1
0 2
0 3
0 4
3 4 7
4 3 0
5
Tag Tags_rel NoSQL 7
Id Name Tag_Id Person_Id Significance .NET
0 .NET 0 0 5
6 PKI
1 Java 1 1 5
2 PKI 2 1 6
3 NoSQL 2 3 10
3 0 7
3 1 7
7
OVERVIEW
Delimitation to RDBMS - Scalability
Knows_rel
Id_1 Id_2
• Relation tables act as a global index over linked 1
1
0
2
data 1
1
3
4
0 1
The bigger the relation table the longer it takes to
0 2
• 0 3
get the interesting information (e.g. local 0
3
4
4
neighbourhood of data) 4 3
Tags_rel
• Solution of graph databases: Information on Tag_Id
0
Person_Id
0
Significance
5
relationships (aka edges) are stored locally on the 1 1 5
vertex
2 1 6
2 3 10
3 0 7
3 1 7
8
OVERVIEW
Delimitation to RDBMS - example of complexity
• Task: Find the persons that are known to Id 0.
Knows_rel
• Linear table scan: O(n) Id_1
1
Id_2
0
1 2
Index scan: O(log n)
1 3
• 1 4
0 1
0 2
• Because of the dependency to n RDBMS do not 0
0
3
4
perform well on recursive search algorithms 3 4
4 3
• Graph database solve this task in O(1)
9
OVERVIEW
Delimitation to other NoSQL products
Size
> 90% of use cases
Complexity
Source: http://www.slideshare.net/jexp/neo4j-graph-database-presentation-german
10
OVERVIEW
Delimitation to other NoSQL products
Size
Key/Value
stores
> 90% of use cases
Complexity
Source: http://www.slideshare.net/jexp/neo4j-graph-database-presentation-german
10
OVERVIEW
Delimitation to other NoSQL products
Size
Key/Value
stores
Bigtable
clones
> 90% of use cases
Complexity
Source: http://www.slideshare.net/jexp/neo4j-graph-database-presentation-german
10
OVERVIEW
Delimitation to other NoSQL products
Size
Key/Value
stores
Bigtable
clones
Document
databases
> 90% of use cases
Complexity
Source: http://www.slideshare.net/jexp/neo4j-graph-database-presentation-german
10
OVERVIEW
Delimitation to other NoSQL products
Size
Key/Value
stores
Bigtable
clones
Document
databases
Graph databases
> 90% of use cases
Complexity
Source: http://www.slideshare.net/jexp/neo4j-graph-database-presentation-german
10
OVERVIEW
Delimitation to other NoSQL products
Size
Key/Value
stores
Bigtable
clones
Document
databases
Graph databases
In-memory
graph databases
> 90% of use cases
Complexity
Source: http://www.slideshare.net/jexp/neo4j-graph-database-presentation-german
10
NEO4J
Overview
• Graph database + Lucene index
• ACID (isolation level read committed)
• High availability in enterprise edition
• 32 billion vertices, 32 billion edges, 64 billion properties
• Embedded or via REST-API
• Support for the Blueprints project
14
NEO4J
Architecture
Cypher/Gremlin Java/Ruby/.../C# API
REST API
Core API (Java)
Caches (files and objects) HA
Record files Transaction-log
Disk(s)
Source: http://www.slideshare.net/rheehot/eo4j-12713065
15
NEO4J
knows
Example of the on-disk layout
Name: Bob
Age: 42
Name: Alice
Age: 23
knows
knows
Name: Carol
Age: 22
Source: https://github.com/thobe/presentations
16
NEO4J
knows
Example of the on-disk layout
Name: Bob
Age: 42
Name
Name: Alice Bob
Name Age: 23
Alice knows
Age
42
Age
23
knows
Name
Carol
Name: Carol
Age: 22
Age
22
Source: https://github.com/thobe/presentations
16
NEO4J
knows
Example of the on-disk layout
SP EP Name: Bob
SN EN Age: 42
knows
Name
Name: Alice SP EP Bob
Name Age: 23
SP EP SN EN
Alice knows
SN EN knows Age
knows 42
Age
23
knows
Name
SP Source Previous
Carol
SN Source Next Name: Carol
EP End Previous Age: 22
EN End Next Age
22
Existent
Nonexistent Source: https://github.com/thobe/presentations
16
NEO4J
knows
Example of the on-disk layout
SP EP Name: Bob
SN EN Age: 42
knows
Name
Name: Alice SP EP Bob
Name Age: 23
SP EP SN EN
Alice knows
SN EN knows Age
knows 42
Age
23
knows
Name
SP Source Previous
Carol
SN Source Next Name: Carol
EP End Previous Age: 22
EN End Next Age
22
Existent
Nonexistent Source: https://github.com/thobe/presentations
16
NEO4J
In-memory layout (cache)
ID
Relationship ID refs
in: R1 R2 ... Rn
Type 1
out R1 R2 ... Rn
Vertex ... Grouped by type (type = „knows“)
• Transformation of the
double linked list (on-disk)
in: R1 R2 ... Rn
Type n
out R1 R2 ... Rn
to objects
Key 1 Key 2 ... Key n
• Increases the traversal
Val 1
Val 2
Val n
speed
ID start end type
Edge
Key 1 Key 2 ... Key n
Val 1
Val 2
Val n
Source: https://github.com/thobe/presentations
17
NEO4J
Traversal
• Relationship-expander (delivers edges of a vertex)
• Evaluators (evaluate if a vertex is going to be traversed or if it
should be taken to the result set)
• Projection of the result set (e.g. „take the last vertex of the path“
• Uniqueness level (sets in steps, whether a node could be visited
several times)
Source: https://github.com/thobe/presentations
18
NEO4J
Cypher & Gremlin
Feature Gremlin Cypher
Paradigm Imperative programming Declarative programming
•Developed Marko Rodriguez (Tinkerpop) •In-house development
Description • •Cypher provides greater opportunities for optimization
Based on xpath to describe the traversal
•Developed using Groovy •Good for traversals that need back tracking
•30-50% faster on „simple“ traversals •Output is a table
START
me=node:people(name={myname})
MATCH
me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item
outE[label=HAS_CART].inV item<-[:PURCHASED]-user-[:PURCHASED]->recommendation
.outE[label=CONTAINS_ITEM].inV RETURN recommendation
Example
.inE[label=PURCHASED].outV
.outE[label=PURCHASED].inV START
d=node(1), e=node(2)
MATCH
p = shortestPath( d-[*..15]->e )
RETURN p
Source: https://github.com/thobe/presentations
19
NEO4J
Pricing
Price
Edition License Description
(annual)
Complete database
Open Source
„Community“ including a basic 0 €
(GPLv3)
management frontend
+
Monitoring, better
„Advanced“ Commercial and AGPL 6,000 €
management frontend and
support
+
„Enterprise“ Commercial and AGPL Enterprise frontend, HA and 24,000 €
premium support
21
INFINITEGRAPH
Overview
• Distributed graph database
• Implemented in C++ (APIs in Java, C#, Python, etc.)
• Based on Objectivity/DB (distributed object database)
• Established 1988 in Sunnyvale, California
• Enterprise-customers + US-government
• Support for Blueprints
23
INFINITEGRAPH
Pricing
Price
Edition License Description
(annual)
Complete database but
„InfiniteGraph FREE“ Free limitation to 1 million 0 €
vertices or edges
starts at app. 5000 $
„Pay as you go“ Commercial No limitation (depends on count of
vertices and edges)
Focus on „bigger“ >..... €
„Unit or site licensing“ Commercial
environments (No price available)
Source: http://objectivity.com/products/infinitegraph/overview
29
FALLEN-8
Overview
• In-memory graph database
• Implemented in C# (platform independent because of mono)
• 4 billion vertices or edges, each element can have app. 65000
properties
• Indexes on vertices and/or edges
• Core is open source (MIT-license), plugins can have any license
31
FALLEN-8
Persistence
• Persistence in form of „save-points“ (all vertices and edges are
serialized en bloc)
• Commodity hardware allows to (de)serialize app. 2 million
vertices or edges per second
• Saving blocks only write operations
• Performance + reliability
32
FALLEN-8
Architecture
Services
Index-
Traversal-framework
framework
Core API
Vertices and edges
RAM
33
FALLEN-8
Architecture and some plugins
HA + ACID Transaktionen
REST API (via JSON) + Management/query frontend
Traversal-framework Index-framework
(incl. path analysis) (incl R* tree index)
Core API
Vertices and edges
RAM
34