Neo, Titan & Cassandra

NEO4J, TITAN &
CASSANDRA
Comparisons

JOHN JENSON
12 YEARS BUILDING SOFTWARE FOR THE WEB
Salt Lake City
 Software consultant, developer, project lead, on dozens of projects
 Extensive experience developing Java apps and using Oracle
 Two largest relational databases, banking database, PHI Database
 Massive relational databases, tables in excess of 500 million rows
Boston
 Cengage Learning – Principal Engineer
 MEAN stack development
 A year of Experience developing with Mongo, and Neo4J
 Researched Cassandra and Titan
 TandemSeven – Principal Architect
 Software Consulting

GRAPH DATABASES
 Arcs and Nodes
 Objects and the relationships between them
 Objects (nodes)
 Schemaless
 Can have arbitrary attributes
 Relationships (arcs)
 Have a type
 Can also have arbitrary attributes

GRAPH DATABASES
 Anything that can be modeled in a relational database, could also be
modeled in relational database.
 Nothing new here
 Querying a tree in SQL sucks
 The power of a graph database comes from the query language
 Oracle provides “connect by” feature for trees, but it only works for trees and you
have to use Oracle
 What if your data is highly connected and breaks the rules of a tree?
 Good luck bringing all of your data into memory and writing your own algorithm to
traverse your data

NEO4J
 Cypher Query
 Powerful query language (this is what sets Neo4J apart from other graph DBs)
MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)
WHERE movie.title =~ "T.*"
RETURN movie.title as title, collect(actor.name) as cast
ORDER BY title ASC LIMIT 10;
 http://docs.neo4j.org/chunked/stable/examples-from-sql-to-
cypher.html (sql to cypher reference)
 http://docs.neo4j.org/refcard/2.0/ (useful cypher reference)

COLUMN DATABASES
 Rows are organized into individual tables
 Columns are represented as rows in those tables
 Designed to reduce IO and seek times when accessing data

CASSANDRA
 Massively distributed
 Can support massive clusters with 75,000+ machines
 CQL – Cassandra Query Language
 No joins or subqueries
SELECT *
FROM users
WHERE last_name = ”smith”;
 MapReduce
 Hadoop is all you get

CASSANDRA - PERFORMANCE
"In terms of scalability, there is a clear winner throughout our
experiments. Cassandra achieves the highest throughput for the
maximum number of nodes in all experiments" although "this comes
at the price of high write and read latencies.” – Toronto University
 Absolutely amazing throughput
 Not so amazing response times for each individual query

TITAN
 Runs on Cassandra or Hbase
 Oracle Berkeley DB
 Can store massive graphs
 Doesn’t support Cypher Query
 Supports Gremlin
 http://sql2gremlin.com/ (useful gremlin reference)

CASSANDRA VS NEO4J
15 points of comparison

CASSANDRA
Cassandra is a non-relational data store that
stores data in tables. Cassandra organizes
columns into rows and rows into tables.
Neo is a graph database that organizes data
into arcs and nodes.
NEO4J
Point 1

CASSANDRA
Because columns are stored as rows, tables
can have a huge number of columns
(maximum of 2 billion columns).
Neo can house at most 34 billion nodes, 34
billion relationships, and 68 billion properties in
total.
NEO4J
Point 2

CASSANDRA
All tables must have an index which is used as
a basis for sharding the data.
Indexes can be added and removed wherever
desired.
NEO4J
Point 3

CASSANDRA
Cassandra has impressive HA capabilities that
can span multiple data centers with little effort.
Neo uses master slave replication.
NEO4J
Point 4

CASSANDRA
Cassandra can elegantly run on huge clusters
that replicate and shard data effortlessly.
Neo doesn’t shard your data.
NEO4J
Point 5

CASSANDRA
Cassandra scales linearly by adding more
hardware. There is pretty much no limit to the
hardware that you can add.
Neo read throughput scales linearly with the
number of servers, but the number of servers
in a cluster has to stay relatively small.
NEO4J
Point 6

CASSANDRA
The dataset can grow virtually endlessly while
still getting the same performance.
The dataset size is limited to at most 34 billion
nodes, 34 billion relationships, and 68 billion
properties in total.
NEO4J
Point 7

CASSANDRA
Cassandra does not use a master/slave
paradigm, so there is no down-time when a
machine dies.
There is a brief window of downtime while a
new master is elected.
NEO4J
Point 8

CASSANDRA
Cannot do traversal queries
Traversal queries that have exponential cost
on traditional RDBMS have linear cost on Neo.
NEO4J
Point 9

CASSANDRA
Write performance is just as good as read
performance.
Write performance is slower than read
performance.
NEO4J
Point 10

CASSANDRA
Every query has additional latency due to
cluster overhead
Individual queries can be serviced much faster
with far less latency.
NEO4J
Point 11

CASSANDRA
ACID transactions are mostly supported, but
with tunable consistency.
ACID transactions are fully supported and
completely consistent, but there is a
performance hit for the consistency.
NEO4J
Point 12

CASSANDRA
Cassandra can perform operations completely
synchronously or alternatively at variously
levels of consistency with corresponding
performance on an operation by operation
basis.
Consistency is not tunable
NEO4J
Point 13

CASSANDRA
Cassandra uses it’s own query language
(CQL) that has similar syntax to SQL (no joins)
Neo uses Cypher and also supports Gremlin
NEO4J
Point 14

CASSANDRA
Instead of performing joins at runtime data
must be de-normalized before hand
Graphs are normalized and highly connected.
Traversals are very fast.
NEO4J
Point 15

CASSANDRA
“Unlike in relational databases, it’s not easy to tune or introduce new
query patterns in Cassandra by simply creating secondary indexes
or building complex SQLs (using joins, order by, group by?) because
of its high-scale distributed nature. So think about query patterns up
front, and design column families accordingly.” –Ebay
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-
best-practices-part-1/#.U-j6fICwIeY

NEO4J
“A single instance of Neo4j can house at most 34 billion nodes, 34
billion relationships, and 68 billion properties, in total. Businesses like
Google obviously push these limits, but in general, this does not pose
a limitation in practice. It is also important to understand that these
limits were chosen purely as a storage optimization, and do not
indicate any particular shortcoming of the product. They are easily,
and are in fact being, increased.”
http://info.neotechnology.com/rs/neotechnology/images/Understanding
%20Neo4j%20Scalability(2).pdf

Gemlin
g.V('customerId','ALFKI').as('customer')
.out('ordered').out('contains').out('is').as('products')
.in('is').in('contains').in('ordered').except(‘customer')
.out('ordered').out('contains').out('is').except('products')
.groupCount().cap().orderMap(T.decr)[0..<5].productName
Cypher
MATCH (c1)-[:ordered]->(o1)-[:contains]->(p1)<-[:contains]-(o2)<-[:ordered]-(c2)-
[:ordered]->(o3)-[:contains]->(p2)
WHERE c1.customerId = "ALFKI" AND c1 != c2 AND p1 != p2
RETURN p2.productName, count(p2) num

SQL
SELECT TOP (5) [t14].[ProductName]
FROM (SELECT COUNT(*) AS [value],
[t13].[ProductName]
FROM [customers] AS [t0]
CROSS APPLY (SELECT [t9].[ProductName]
FROM [orders] AS [t1]
CROSS JOIN [order details] AS [t2]
INNER JOIN [products] AS [t3]
ON [t3].[ProductID] = [t2].[ProductID]
INNER JOIN [orders] AS [t5]
ON [t5].[OrderID] = [t4].[OrderID]
LEFT JOIN [customers] AS [t6]
ON [t6].[CustomerID] = [t5].[CustomerID]
CROSS JOIN ([orders] AS [t7]
ON [t9].[ProductID] = [t8].[ProductID])
WHERE NOT EXISTS(SELECT NULL AS
[EMPTY]
FROM [orders] AS [t10]
ON [t12].[ProductID] =
[t11].[ProductID]
WHERE [t9].[ProductID] =
[t12].[ProductID]
AND [t10].[CustomerID] =
[t0].[CustomerID]
AND [t11].[OrderID] =
[t10].[OrderID])
AND [t6].[CustomerID] <> [t0].[CustomerID]
AND [t1].[CustomerID] = [t0].[CustomerID]
AND [t2].[OrderID] = [t1].[OrderID]
AND [t4].[ProductID] = [t3].[ProductID]
AND [t7].[CustomerID] = [t6].[CustomerID]
AND [t8].[OrderID] = [t7].[OrderID]) AS [t13]
WHERE [t0].[CustomerID] = N'ALFKI'
GROUP BY [t13].[ProductName]) AS [t14]
ORDER BY [t14].[value] DESC

CONCLUSION
If one plans on writing a recommendation
queries, a graph db is a more elegant fit than a
relational DB.
Only use Titan if you need it
 You have an insanely large graph
 Or you expect an insanely high load
Neo4J
 Faster queries
 A more straightforward and powerful query
language

Neo, Titan & Cassandra

More Related Content

What's hot

Viewers also liked

Similar to Neo, Titan & Cassandra

Recently uploaded

Neo, Titan & Cassandra