Graph Databases

Graph Databases
&
Neo4J
Girish Khanzode

Graph Databases
• Graph Based NoSQL Database
• Property Graph Model
• Neo4j
• Noe4j Architecture
• Data Storage
• Programmatic Data Access
• Core API
• Lucene
• Auto Index lifecycle
• Traversers API
• Cypher
• Graph Algorithms
• Neo4j HA
• Cache Sharding
• References

Graphs
• A collection nodes (things) and edges (relationships) that
connect pairs of nodes
– Suitable for any data that is related
• Can attach properties (key-value pairs) on nodes and
relationships
• Relationships connect two nodes and both nodes and
relationships can hold an arbitrary amount of key-value pairs

Graphs
• Well-understood patterns and algorithms
– Studied since Leonard Euler's 7 Bridges (1736)
– Codd's Relational Model (1970)
• Knowledge graph - beyond links, search is smarter when considering how things
are related
• Facebook graph search – people interested in finding things in their part of the
world
• Bing + Britannica: referencing and cross-referencing
• People - relationships to people, to organizations, to places, to things - personal
graph

A Graph Database
• Relationships are first citizens
• NoSQL database optimized for connected data
– Social networking, logistics networks, recommendation engines
– Relationships are as important as the records
– 1000 times faster than RDBMS for connected data
• Uses graph structures with nodes, edges and properties to store data
• Open source graph databases - Neo4j, InfiniteGraph, InfoGrid,OrientDB
• Very fast querying across records

A Graph Database
• Transactional with the usual operations
• RDBMS - can tell sales in last year
• Graph database – can tell customer which book to buy next
• Index-free adjacency
– Every node is a pointer to its adjacent element
• Edges hold most of the important information and relations
– nodes to other nodes
– nodes to properties

Graph Based NoSQL Database
• No rigid format of SQL or the tables and columns representation
• Uses a flexible graphical representation - addresses scalability concerns
• Data can be easily transformed from one model to the other using a
graph based NoSQL database
• Nodes are organised by some relationships with one another represented
by edges between the nodes
• Both nodes and the relationships have some defined properties

Graph Based NoSQL Database
• Labelled, directed, attributed multi-graph - Graphs contains nodes which
are labelled properly with some properties and these nodes have some
relationship with one another which is shown by the directional edges
• While relational database models can replicate the graphical ones, the
edge would require a join which is a costly proposition

Advantages
• Easier Relationships Analysis
• Very fast for associative data sets
– Like social networks
• Map more directly to object oriented applications
– Object classification and Parent->Child relationships

Disadvantages
• If data is just tabular with not much relationship between the
data, graph databases do not fare well
• OLAP support for graph databases not mature

Performance Experiment
• Compute social network path exists
• 1000 persons
• Average 50 friends per person
• pathExists(a, b) limited to depth 4
# persons query time
Relational
database
1000 2000ms
Neo4j 1000 2ms
Neo4j 1000000 2ms

Property Graph Model
name: the Doctor
age: 907
species:Time Lord
first name: Rose
late name:Tyler
vehicle: Skoda
model:Type 40

Graphs -Whiteboard-friendly
• No decomposition, ER design, normalization / de-
normalization as needed with RDBMS

Neo4j
• A Graph Database
• A Property Graph containing Nodes, Relationships with Properties on
both
• Manage complex, highly connected data
• Scalable - High-performance with High-Availability
– Traverse 1,000,000+ relationships / second on commodity hardware
• Server with REST API, or Embeddable on the JVM

Neo4j
• Full ACID transactions
• Schema free, bottom-up data model design
• Stable
• Easier than RDBMS since no need for normalization
• Implemented in Java
• Open Source

Neo4j
• Schema free – Data does not have to adhere to any convention
• Support for wide variety of languages - Java, Python, Perl, Scala,Cypher
• A graph database can be thought of as a key-value store, with full support
for relationships.
• Graph databases don’t avoid design efforts
• Good design still requires effort

Why Neo4J?
• The internet is a network of pages connected to each other.
What is a better way to model that than in graphs?
• No time lost fighting with less expressive data-stores
• Easy to implement experimental features
• A single instance of Neo4j can house at most 34 billion nodes,
34 billion relationships and 68 billion properties

Core API
REST API
JVM Language Bindings
Traversal Framework
Caches
Memory-Mapped (N)IO
Filesystem
Java Ruby Clojure…
Graph Matching
Noe4j Architecture

Data Storage
• Neo4j stores graph data in a number of different store files
• Each store file contains the data for a specific part of the
graph
– neostore.nodestore.db
– neostore.relationshipstore.db
– neostore.propertystore.db
– neostore.propertystore.db.index
– neostore.propertystore.db.strings
– neostore.propertystore.db.arrays

Node Store
• Size: 9 bytes
– 1st byte - in-use flag
– Next 4 bytes - ID of first relationship
– Last 4 bytes - ID of first property of node
• Fixed size records enable fast lookups

Relationship store
• neostore.relationshipstore.db
• Size: 33 bytes
• 1st byte - In use flag
• Next 8 bytes - IDs of the nodes at the start and end of the relationship
• 4 bytes - Pointer to the relationship type
• 16 bytes - pointers for the next and previous relationship records for each of the start and end nodes. (
property chain)
• 4 bytes - next property id

Data Size
nodes 235 (∼ 34 billion)
relationships 235 (∼ 34 billion)
properties 236 to 238 depending on property types (maximum ∼ 274
billion, always at least ∼ 68 billion)
relationship
types
215 (∼ 32 000)

Programmatic Data Access
• JavaAPIs - JVM languages bind to sameAPIs
• JRuby, Jython, Clojure, Scala…
• Manage nodes and relationships
• Indexing – find data without traversal
• Traversing
• Path finding
• Pattern matching

Core API
• Deals with graphs in terms of their fundamentals
• Nodes - properties
– KV Pairs
• Relationships
– Start node
– End node
– Properties
• KV Pairs

Create Node
GraphDatabaseService db = new EmbeddedGraphDatabase("/tmp/neo");
Transaction tx = db.beginTx();
try {
Node theDoctor = db.createNode();
theDoctor.setProperty("character", "the Doctor");
tx.success();
} finally
{
tx.finish();
}

Create Relationships
try {
Node theDoctor = db.createNode();
theDoctor.setProperty("character", "The Doctor");
Node susan = db.createNode();
susan.setProperty("firstname", "Susan");
susan.setProperty("lastname", "Campbell");
susan.createRelationshipTo(theDoctor,DynamicRelationshipType.withName("COMPANION_OF"));
tx.success();
} finally
{
tx.finish();
}

Index a Graph
• Graphs themselves are indexes
• Can create short-cuts to well-known nodes
• In program, keep a reference to any interesting node
• Indexes offer flexibility in what constitutes an “interesting
node”

Lucene
• The default index implementation for Neo4j
– Default implementation for IndexManager
• Supports many indexes per database
• Each index supports nodes or relationships
• Supports exact and regex-based matching
• Supports scoring
– Number of hits in the index for a given item
– Great for recommendations

Create a Node Index
GraphDatabaseService db = …
Index<Node> planets = db.index().forNodes("planets");
Type
Type
Indexname
CreateOR
retrieve

Create a Relationship Index
Index<Relationship> enemies = db.index().forRelationships("enemies");
Type
Type
Indexname
CreateOR
retrieve

Exact Matches
Index<Node> actors = doctorWhoDatabase.index().forNodes("actors");
Node rogerDelgado = actors.get("actor", "Roger Delgado“).getSingle();
Valueto
match
Firstmatch
only
Key

Query Matches
Index<Node> species = doctorWhoDatabase.index().forNodes("species");
IndexHits<Node> speciesHits = species.query("species“,"S*n");
Query
Key

Transactions to Mutate Indexes
• Mutating access is still protected by transactions which cover both index and graph
try {
Node nixon= db.createNode();
nixon("character", "Richard Nixon");
db.index().forNodes("characters").add(nixon,
"character“, nixon.getProperty("character"));
tx.success();
} finally {
tx.finish();
}

Auto Index lifecycle
• Auto Index - stays consistent with the graph data
• Specify the property name to index while creation
• If node/relationship or property is removed from the graph it is removed
from the index
• If database started with auto indexing enabled but different auto indexed
properties than the last run, then already auto-indexed entities will be
deleted as they are worked upon
• Re-indexing is a manual
– Existing properties not indexed unless touched

Auto Index lifecycle
AutoIndexer<Node> nodeAutoIndex = graphDb.index().getNodeAutoIndexer();
nodeAutoIndex.startAutoIndexingProperty("species");
nodeAutoIndex.setEnabled( true );
ReadableIndex<Node> autoNodeIndex = graphDb.index().getNodeAutoIndexer().getAutoIndex();
Node -> Relationship Indexes Supported

Core API
• Basic (nodes, relationships)
• Fast
• Imperative
• Flexible - Easily intermix mutating operations

Traversers API
• Mechanisms to query graph navigating from starting node to
related nodes according to algorithm to get answers
• Expressive
• Fast
• Declarative (mostly)
• Opinionated

Cypher - A Graph Query Language
• Query Language for Neo4j
• A declarative graph pattern matching language
– SQL for graphs
– Tabular results
• aggregation, ordering and limits
• Mutating operations
• CRUD
• Easy to formulate queries based on relationships
• Many features stem from improving pain points of SQL like join tables

Cypher - A Graph Query Language

Query
• Query:
MATCH(n:Crew)-[r:KNOWS*]-m
WHERE n.name = ‘Neo’
RETUEN nAS Nep,r,m

Operations
• Aggregation - COUNT, SUM, AVG, MAX, MIN, COLLECT
• Where clause
start doctor=node:characters(name = 'Doctor‘)
match (doctor)<-[:PLAYED]-(actor)-[:APPEARED_IN]->(episode) where actor.actor = 'Tom
Baker‘ and episode.title =~ /.*Dalek.*/
return episode.title
• Ordering
– order by <property>
– order by <property> desc

Graph Algorithms
• Neo4j has built-in algorithms
• Callable through JVM and REST APIs
• Higher level of abstraction
• Graph Matching
– Look for patterns in a data set - retail analytics
– Higher-level abstraction than raw traversers
• REST API
– Access the server
• Binary protocol
– JSON as default format

Neo4j HA - High Availability Cluster
• A scalability package known as high availability or HA that
uses a master-slave cluster architecture
– Full data redundancy
– Service fault tolerance
– Linear read scalability
– Master-slave replication
• Single data-centre or global zones
– tolerance for high-latency

Neo4j HA
• Redundancy - improved uptime
– automatic failover
• In a Neo4j HA cluster the full graph is replicated to each instance in the
cluster.
• Full dataset is replicated across the entire cluster to each server
• Read operations can be done locally on each slave
• Read capacity of the HA cluster increases linearly with the number of
servers

HA Cluster Architecture
• Cluster performs automatic master election
• Supports master-slave replication for clustering and DR
across sites

Write to a Master
• All write operations are co-ordinated by the master
• Writes to the master are fast
• Slaves eventually catch up

Write to a Slave
• Writes to a slave cause a synchronous transaction
with the master
• Other slaves eventually catch up

Server Overload Problem
• Unlike other classes of NOSQL database, a graph does not
have predictable lookup since it is a highly mutable structure
• We want to co-locate related nodes for traversal
performance, but we don’t want to place so many connected
nodes on the same database that it becomes heavily loaded
• The black-hole problem - popular nodes get lumped together
on a single instance, but there is low point cut

Thinly Spread Network
• The opposite is also true, that we don’t want too widely connected nodes
across different database instances since it will incur a substantial
performance penalty at runtime as traversals cross the (relatively latent)
network
• Load-leveling alone can lead to many relationships crossing instances
• These are very expensive to traverse, networks are many orders of
magnitude slower than in-memory traversals

Minimal Point Cut
• The best approach is to balance a graph across database instances by
creating a minimum point cut for a graph, where graph nodes are placed
such that there are few relationships that span shards
• Good strategy is to take a local view of the graph (no global locks) and
work incrementally (short bursts)
• Take into account use patterns
• Unlike other NoSQL stores, graph s are not predictable so we can not use
techniques like consistent hashing for scale out

Cache Sharding
• A strategy for large data sets of terabyte scale
• Mandates consistent request routing
• For instance, requests for user A are always sent to server 1,
while requests for user B are always sent to server 2 and so on
• The key assumption is that requests for user A typically touch
parts of the graph around user A, such has his or her friends,
preferences, likes and so on

Cache Sharding
• This means that the neighbourhood of the graph around user
A will be cached on server 1, while the neighbourhood around
user B will be cached on server 2
• By employing consistent routing of requests, the caches of all
servers in the HA cluster can be utilized maximally
• Strategy is highly effective for managing a large graph that
does not fit in RAM

Consistent Routing
• Always try to route related requests to the same server to hopefully
benefit from warm caches

Domain Specific Sharding
• No easy to shard graphs like documents or KV stores
• High performance graph databases limited in terms of data set size that
can be handled by a single machine
• Use replicas to speed up and improve availability but limits data set size
limited to a single machine’s disk/memory
• No perfect algorithm exists but domain insight of expert helps

Domain Specific Sharding
• Some domains can shard easily (geo, most web apps) using consistent
routing approach and cache sharding
– Geo - where the connections between cities are few compared with the
connections within the cities. So can place cities or countries on different
nodes
• Eventually (Petabytes) level data cannot be replicated practically
• Need to shard data across machines

References
1. http://www.neo4j.org
2. http://www.neo4j.org/learn/cypher
3. Bachman, Michal (2013)GraphAware -TowardsOnline Analytical Processing in Graph Databases
http://graphaware.com/assets/bachman-msc-thesis.pdf
4. Hunger, Michael (2012). Cypher and Neo4j http://vimeo.com/83797381
5. Mistry, Deep Neo4j: A Developer’s Perspective
http://osintegrators.com/opensoftwareintegrators%7Cneo4jadevelopersperspective
6. MapGraph:A High LevelAPI for Fast Development of High Performance GraphAnalytics on GPUs
7. Parallel Breadth First Search on GPU Clusters
8. DB-Engines Ranking of Graph DBMS

ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

Graph Databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graph Databases

Similar to Graph Databases (20)

More from Girish Khanzode

More from Girish Khanzode (12)

Recently uploaded

Recently uploaded (20)

Graph Databases