Performance of graph query languages:
Analysis on theperformance of graph querylanguages: comparative study of Cypher, Gremlin and native access in Neo4j
1. 1
Analysis on the
performance of graph query
languages: Comparative
study of Cypher, Gremlin
and native access in Neo4j
Athiq Ahamed, ITIS, TU-Braunschweig
Supervised by: Dr. Lena Wiese
Georg-August-University Göttingen
Prof. Dr. René Peinl, Florian Holzschuher
Performance of graph query languages:
Comparison of Cypher,
Gremlin and native access in Neo4j
2. 2
Agenda
• RDBMS
• Reason for NoSQL
• Categories of NoSQL databases
• Comparison of popular NoSQL databases
• Motivation
• Neo4j and Query Languages
• Comparison of Neo4j to other databases
• Testing (importance of benchmarking, different suites)
• Results
• Limitations and Future work
3. 3
Introduction 1
RDBMS
• For decades relational databases been a dominant choice
• Structured Query Language (SQL) retrieves data with ease
• Currently, Outsized volumes of dynamic data is been
developed
• Strict schemas and joining several tables for answering
queries
• Not a good choice for current state
• So we require dynamic schemas, high scalability, high
performance and so on
4. 4
Introduction 2
• NoSQL databases are the first choice now, solves most the
problems
• Graph databases are best suited for storing networks of data
(social networking)
Features
– NoSQL database has a proper query language
– NoSQL databases do either trade availability or consistency in favor of
partition-tolerance (CAP).
– Neo4j, Cassandra, MongoDB, BigTable to name a few
• It is an ideal choice for web 2.0
5. 5
NoSQL databases
• Four important categories of NoSQL databases
Key-values Stores Column Family
Stores
Document stores Graph Databases
Simplest and easy
to implement,
having a hash table
with a unique key
to the value as a
pointer
Widely used for
data distribution,
where keys point to
multiple columns
Used for semi
structured data,
storing it in JSON
format similar to
key-value store
Used for storing
graph like data e.g.
social networks
Redis, Oracle BDB,
Voldemort
BigTable model of
Google
MongoDB Neo4j
6. 6
Comparison Between Popular NoSQL
Databases
MongoDB
(Document-oriented)
Rank No. 1
Cassandra
(Wide Column)
Rank No. 2
Neo4j
(Graph)
Rank No. 5
Replication and Failover for
high availability
Trade off is done for
consistency providing high
availability
Neo4j which is very similar
to MongoDB with blocking
replication, cluster setup for
high availability
Consistency is default, auto
sharding to ease scalability,
replication, full index
support
Cassandra with incremental
scalability, high availability,
very eventually consistent
Neo4j with scalable
clustering support,
runtime failover, Live
Backup support
7. 7
Different types of DBs and Languages
Databases Languages
Relational Databases SQL
XML databases XPATH, XQUERY
RDF RQL, SPARQL
Objected oriented OQL
Multidimensional MDX
Graph Cypher, Gremlin
8. 8
Motivation
• To measure the performance of different graph query
languages and native access in Neo4j
• Compare ease of understanding , code readability,
maintainability of the languages
• Test the performance and correctness of these graph
databases
• Apache Shindig, for hosting OpenSocial applications
• Compare performance of different back-ends on Neo4j
9. 9
Neo4j and Query Languages
• Neo4j, is an open-source NoSQL graph database
• Which implements the property graph data model
• Neo4j has a native Java Api with a traversal framework
• Features
– Supports ACID properties
– Runtime failover
– High performance
– Scalability
– Very good documentation
– Very good query language, Cypher
• Cypher, declarative query language similar to SQL
• Gremlin, Groovy based query language
10. 10
Comparison of Neo4j to other DBs
Existing Work
Neo4j and MySQL Neo4j and Other graph database
Neo4j retrieved results faster than relational
databases
Data used for testing performance: 1k, 32k
and 1m nodes reaching from 9k
relationships to 8.4 million relationships
Flexible than MySQL Jena and HypergraphDB were not able to
load the database in a specified time
Query times are 2-5 times lower that MySQL
for their 500 objects data set
DEX and Neo4j were able to load the largest
benchmark sizes
Neo4j performed better at the structural
type queries than SQL
Jena could load the graph with 1M nodes
faster than Neo4j but it couldn’t scale
Neo4j were slower than MySQL with integer
data
Neo4j is faster than DEX for the large
dataset, and the reverse happens for the
small dataset
So, Neo4j is used for queries like friendship,
movie favorites and more complicated
commercial purposes queries
DEX is able to scale better, whereas Neo4j
obtained a good throughput
11. 11
Setup
• Apache shindig 2.5, for hosting OpenSocial applications
• Neo4j has a native Java Api with which we can retrieve and
traverse methods
• Also directly accessible when neo4j is in embedded mode
• A RESTful (REST stands for Representational State Transfer)
web service interface
• Several wrappers for various programming languages like
python and java
• Cypher is used for all the CRUD (create, read, update and
delete)
• Gremlin does both imperative and declarative querying
12. 12
Data Used for testing
• 2011 people
• 26,982 messages
• 24,365 activities
• 2000 address
• 200 groups
• 100 organizations
• They even tested on a bigger dataset 10,003 people
• One had at least 1 friend or a maximum of 667 friends from
25,0000 friendship relationships
• For bigger dataset 10,003 people, there were 137,000
friendships in total, a maximum of 1,448 friends for one
person
13. 13
Suites used for testing
• Neo4j embedded
• Neo4j REST
• Neo4j Cypher embedded
• Neo4j Cypher REST
• Neo4j Gremlin Rest
• MySQL JPA
• These suites retrieves profiles, friends, group recommendations
and other social networking features
14. 14
Results 1
Comparison of query languages and native access
Native object access Cypher Gremlin SQL
Can retrieve and
traverse methods,
with a traversal
framework
Declarative query
language does all the
CRUD operations
Groovy based query
language with a
compact syntax
Structured query
language, simple to
understand
Difficult to learn, Easy to learn, Difficult to learn Easy to learn
Several lines of codes
for simple retrieval
Simple and easy to
understand
Compact syntax,
difficult to understand
Several lines of
code
Comparable Good for complex
retrieval
Good for small
retrieval
Slows down for
complicated
queries
15. 15
Results 2 - Gremlin vs. Cypher
Cypher
START person= node:people(id = {id})
MATCH person-[:FRIEND_OF] -> friend-[:FRIEND_OF]
-> friend_of_friend
WHERE not (friend_of_friend <- [:FRIEND_OF]-person)
RETURN friend_of_friend, COUNT(*)
ORDER BY COUNT(*) DESC
Gremlin
t = new Table();
x = [];"
g.idx('persons')[[id:id_param]].
out('FRIEND_OF').fill(x);"
g.idx('persons')[[id:id_param]].out('FRIEND_OF').
out('FRIEND_OF').dedup().except(x).id.as('ID').
back(1).displayName.as('name').
table(t,['ID','name']){it}{it}.iterate();
t
Friend Suggestion For A
Person
16. 16
Results 3 - Gremlin vs. Cypher
Queries Cypher and Gremlin Performance
Friend queries (simple) Gremlin is bit faster than Cypher
Peoples queries Gremlin is slower than Cypher
Message queries Gremlin is on par with Cypher
FOAF queries (complicated) Cypher better than Gremlin
• Gremlin is slower when there are complicated pattern matching
• Complex queries with many properties, relationships Cypher out
performed Gremlin
• Gremlin is better for simple cases
17. 17
Results 4 - from Original Paper
Figure 1: 2000 people in ms Figure 2: Gremlin vs Cypher in ms
18. 18
Results 5
• Embedded instance way faster than DBMS over the network
• Neo4j query languages outperform JPA for friend queries
• Remote access with REST slower compared to the embedded
Neo4j native object access
• JPA VS RESTful cypher and gremlin very interesting
– For person profile JPA back-end performances equally good as RESTful
cypher
19. 19
Results 6
• Friend queries are more than one order of magnitude slower
for JPA
• Neo4j showed a constant performance when increasing from
2000 to 10,000 persons
• MySQL drops performance by a factor of 5 for people queries
• MySQL drops performance by a factor of 7-9 for peoples
friends queries
• Restful case is slower than JPA in most of the cases
20. 20
Limitation
• The data which they used was realistic to an extent
• Results always showed some fluctuations
• Not good for benchmarking and using the results for further
research because of fluctuations
• They have used different Cypher queries for embedded and
rest benchmarking
• Neo4j’s normal server settings were used
• Haven't tested Neo4j´s advanced version with load balancing
21. 21
Conclusion and Future work
• Analyzed the performance and programming effort for
different back-ends
• Compared JPA back-end using MySQL with Cypher and
Gremlin
• Neo4j with Cypher had better performance overall
• Gremlin performed better with simple queries
• Cypher performed better with complicated queries
• Neo4j is a good replacement for the traditional RDBMS for
web 2.0
• Future work: To implement and test with an interesting
approach of spring data Neo4j