How to build a
scalable graph database
Bryn Cooke
The smart way
In this talk
1. What does it take to build a graph database?
2. Why shouldn’t you do this at home.
3. What do you use this for?
Graph family tree
Graph database recipe
1. Model
2. Language
3. Storage
Model
bob
since: 2001
steph
bob:
knows
:steph
age: 30age: 34
knows known
Property Graph RDF
Language
g.V().has('name', 'marko').out('knows').values('name')
Storage
The adjacency list
Vertex Adjacent to
A B, D, E
B
C B
D C
E D, F
F
A
B
C
E
D
F
//TODO
• Storage
• Indexing
• Commit log
• Drivers
• Caching
• Schema
• Metrics
• Backup/Restore
• Logging
• Security
• Testing
• Support
• Failover
• QoS
• Paging
• Partitioning
• Sorting
• Compaction
• Repair
• Community
• Bux fixing
• Optimisation
Storage - Cassandra
• Fast
• Distributed
• Scalable
• Reliable
• 11 years of development
• 54 committers (listed on apache)
• 274 contributors (listed on github)
The adjacency list (in Cassandra)
Here's what you could do
C*
C*
C*
C*C*
My Graph
Database
Client
Client
Client
Client
Client
Here's what you could do
C*
C*
C*
C*C*
My Graph
Database
Here's what you should do
C*
C*
C*
C*C*
DS Graph
Client
Client
Client
Client
Deep integration with DataStax Enterprise
DataStax Enterprise
• DataStax Enterprise scalability > Cassandra scalability.
• Analytics integration.
• Search integration.
• Thread optimisation.
• Continuous paging.
• Prefetching.
• First class schema integration.
Today’s Graph Database Market
Graph
Problems > Graph
Databases
Typical customer 360 queries
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
• Find me Jenny.
• Find me all people
with similar names
to 'Jenny'.
• Tell there are
duplicate Jennys.
• Find how Jenny
and John are
connected.
• Find how
influential Jenny is
in my application.
Find me Jenny
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
How Complex?
• Simple
How Fast?
• Machine
What?
• CQL
Why?
• Single partition
lookup
• Single iteration
Find me all people with similar names to 'Jenny'
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
How Complex?
• Medium
How Fast?
• Human Fast
What?
• Search
• Graph
Why?
• Single index
lookup
• Single iteration
Tell there are duplicate Jennys
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
How Complex?
• Medium
How Fast?
• Offline
What?
• Analytics
• Graph
Why?
• Aggregation
• Multiple Iteration
Find how Jenny and John are connected
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
How Complex?
• Complex
How Fast?
• Machine
What?
• Graph
Why?
• Multiple partition
lookup
• Multiple iteration
Find how influential Jenny is in my application
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
How Complex?
• Complex
How Fast?
• Offline
What?
• Spark Analytics
• Graph via PageRank
Why?
• Full scan
• Unknown iterations
Typical customer 360 queries
Offline
fast
Human
fast
Machine
fast
Analytics
CQL
Search
Responsetime
Simple Complex
No go zone
DSE
• Find me Jenny.
• Find me all people
with similar names
to 'Jenny'.
• Tell there are
duplicate Jennys.
• Find how Jenny
and John are
connected.
• Find how
influential Jenny is
in my application.
Summary
1. What it takes to create a graph database
a. Model
b. Language
c. Storage
2. How you can leverage an existing storage engine, and why Cassandra is a
great choice.
3. Solving graph problems requires more than just the basics. Search and
Analytics are essential tools, especially graph database.
Don't try this at home
Do not try replicate 100 person years of
dev effort creating your own storage
engine.
Creating a graph database that scales is
tough enough.
Try it now
https://downloads.datastax.com/#labs
Labs
Thank You

Graph in Apache Cassandra. The World’s Most Scalable Graph Database

  • 1.
    How to builda scalable graph database Bryn Cooke The smart way
  • 2.
    In this talk 1.What does it take to build a graph database? 2. Why shouldn’t you do this at home. 3. What do you use this for?
  • 3.
  • 4.
    Graph database recipe 1.Model 2. Language 3. Storage
  • 5.
  • 6.
  • 7.
  • 8.
    The adjacency list VertexAdjacent to A B, D, E B C B D C E D, F F A B C E D F
  • 9.
    //TODO • Storage • Indexing •Commit log • Drivers • Caching • Schema • Metrics • Backup/Restore • Logging • Security • Testing • Support • Failover • QoS • Paging • Partitioning • Sorting • Compaction • Repair • Community • Bux fixing • Optimisation
  • 10.
    Storage - Cassandra •Fast • Distributed • Scalable • Reliable • 11 years of development • 54 committers (listed on apache) • 274 contributors (listed on github)
  • 11.
    The adjacency list(in Cassandra)
  • 12.
    Here's what youcould do C* C* C* C*C* My Graph Database Client Client Client Client
  • 13.
    Client Here's what youcould do C* C* C* C*C* My Graph Database
  • 14.
    Here's what youshould do C* C* C* C*C* DS Graph Client Client Client Client
  • 15.
    Deep integration withDataStax Enterprise DataStax Enterprise • DataStax Enterprise scalability > Cassandra scalability. • Analytics integration. • Search integration. • Thread optimisation. • Continuous paging. • Prefetching. • First class schema integration.
  • 16.
    Today’s Graph DatabaseMarket Graph Problems > Graph Databases
  • 17.
    Typical customer 360queries Offline fast Human fast Machine fast Analytics CQL Search Responsetime Simple Complex No go zone DSE • Find me Jenny. • Find me all people with similar names to 'Jenny'. • Tell there are duplicate Jennys. • Find how Jenny and John are connected. • Find how influential Jenny is in my application.
  • 18.
    Find me Jenny Offline fast Human fast Machine fast Analytics CQL Search Responsetime SimpleComplex No go zone DSE How Complex? • Simple How Fast? • Machine What? • CQL Why? • Single partition lookup • Single iteration
  • 19.
    Find me allpeople with similar names to 'Jenny' Offline fast Human fast Machine fast Analytics CQL Search Responsetime Simple Complex No go zone DSE How Complex? • Medium How Fast? • Human Fast What? • Search • Graph Why? • Single index lookup • Single iteration
  • 20.
    Tell there areduplicate Jennys Offline fast Human fast Machine fast Analytics CQL Search Responsetime Simple Complex No go zone DSE How Complex? • Medium How Fast? • Offline What? • Analytics • Graph Why? • Aggregation • Multiple Iteration
  • 21.
    Find how Jennyand John are connected Offline fast Human fast Machine fast Analytics CQL Search Responsetime Simple Complex No go zone DSE How Complex? • Complex How Fast? • Machine What? • Graph Why? • Multiple partition lookup • Multiple iteration
  • 22.
    Find how influentialJenny is in my application Offline fast Human fast Machine fast Analytics CQL Search Responsetime Simple Complex No go zone DSE How Complex? • Complex How Fast? • Offline What? • Spark Analytics • Graph via PageRank Why? • Full scan • Unknown iterations
  • 23.
    Typical customer 360queries Offline fast Human fast Machine fast Analytics CQL Search Responsetime Simple Complex No go zone DSE • Find me Jenny. • Find me all people with similar names to 'Jenny'. • Tell there are duplicate Jennys. • Find how Jenny and John are connected. • Find how influential Jenny is in my application.
  • 24.
    Summary 1. What ittakes to create a graph database a. Model b. Language c. Storage 2. How you can leverage an existing storage engine, and why Cassandra is a great choice. 3. Solving graph problems requires more than just the basics. Search and Analytics are essential tools, especially graph database.
  • 25.
    Don't try thisat home Do not try replicate 100 person years of dev effort creating your own storage engine. Creating a graph database that scales is tough enough.
  • 26.
  • 27.