A case-for-graph-db

A case for
Graph Database?
dhaval.dalal@software-artisan.com
!
@softwareartisan
11th Sept 2014

Context
Direct and Cross-Functional reporting represents a network even for
a simple organisation.
What about modelling a group?

Apiary Functionality
Structural Operations
Mine Organisational
Data
!
• Expand/Collapse levels
• View lineage
• Summary Data at all levels
• CRUD on all data (nodes/
relationships)
• Link/De-link Sub-graphs or
nodes
• Evolving attributes of nodes
and relationships based
• Adding new nodes and
relationships
!
• Affinities Graph - Who talks to
whom the most
• Discover Skills Communities
• Detecting overlap using SLPA
(Speaker-listener Label
Propogation)

Convince ourself first!
• Anyone should ask - “What makes it a case for
Graph DB? and Can you prove it?”
• Its basically a de-risking act.
• Two major aspects that we looked at
• Flexibility in schema evolution
• Performance

What to compare against?
• RDBMS’ are a natural choice to be compared against.
• MongoDB, though a NoSQL document store
• good for storing DDD style aggregates.
• not for inter-connected data.
• We picked Neo4j
• But remember, this is not a battle, we are just trying to
find out when you should use what!

Flexibility in Evolution
• Entity Diversity
• Different kinds of nodes

• Connection Diversity
• links could have different weights, directions.

• Connection Diversity
• links could have different weights, directions.
• Evolution of Entities and Links, themselves over time.
• Varietal data needs
• Is every node/link structured regularly or irregularly,
connected or disconnected nodes etc…

Minimal set of functionality
Analysis Model (Phase 1)
Analysis Model (Phase 1)
Neo4J Domain Model
1) Neo4J Domain Model
Node Properties
Person name
Node Properties
Person name
type
level
type
level
Relationships Properties
DIRECTLY_MANAGES N/A
Note: For the purpose of establishing the case, we have modeled minimal relationships and
not all the relationships that would have been in the final application. Below is a list of
relationships that are yet pending to be modeled, but are not relevant for the purposes of
taking performance measurements.
Relationships Properties
DIRECTLY_MANAGES Relationships N/A
Properties
2) SQL Domain Model
Queries
Above screen-flow and modeling for organizations and groups use cases requires us to run

Measured performance of 3
1) Gather Subordinates names till a visibility level from a current level
CURRENT
LEVEL
queries
We have varied total hierarchy levels in the organization Aggregate
from 3 to 8 for people. Optimized generic Cypher query is:
Data
start n = node:Person(name = "fName lName")
match p = n-[:DIRECTLY_MANAGES*1..visibilityLevel]->m
return nodes(p)
Where visibility level is a number that indicates the number of levels to show.
For SQL we have to recursively add joins for each level, a generic SQL can SELECT manager.pid AS Bigboss,
manager.directly_manages AS Subordinate,
L1Reportees.directly_manages AS Reportee1,
...
FROM person_reportee manager
LEFT JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
ON L1Reportees.directly_manages = L2Reportees.pid
...
...
Current
Level
WHERE manager.pid = (SELECT id
FROM person
WHERE name = "fName lName")
Names
• Subordinate names from
current level until a visibility
level
• Aggregate Data from current
level until a visibility level
• Overall Aggregate Data for
Dashboard
• distribution of people at various
levels
2) Gather Subordinates Aggregate data from current level
We have varied total hierarchy levels in the
organization from 3 to 8 for different volumes
of people.
Further, the optimized Generic Cypher query is:
match n-[:DIRECTLY_MANAGES*0..(totalLevels -
n.level)]->m-[:DIRECTLY_MANAGES*1..(totalLevels
- n.level)]->o
where n.level + visibilityLevel >= m.level
return m.name as Subordinate, count(o) as Total
For SQL aggregate query, we not only have to recursively add joins but also perform inner
unions for each level till the last level to obtain the aggregate data for that level. Once we
obtain the data for a particular level (per person), we perform outer unions to get the final
result for all the levels. This results in a very big SQL query.
Here is a sample query that returns aggregate data for 3 levels below the current level. Say,

Apple-to-Apple comparison
• MySQL, MS-SQL and Neo4j
• We did not use Traversal API (though faster), just as
Cypher is to Neo4j as SQL is to RDBMS’
• For longer term, Neo4j intends to further Cypher Query planning
and optimisation.
• Indexing
• Enabled on join columns for MySQL and MS-SQL DBs
• For Neo4j, Person names were indexed

Environment Consistency
• Same machine for all databases
• MySQL v5.6.12, MS-SQL Server 2008 R2, Neo4j v1.9
(Advanced)
• DB and tools on the same machine
• Avoid network transport times being factored in.
• Out-of-box settings for all DBs apart from giving 3
GB to the java process that ran Neo4j in Embedded
mode.

Functional Equivalence
• Consistent data distribution.
• 8 Levels in org with people at each level managing the next
for all DBs.
• Functionally equivalent queries.
• Measurements for worst possible queries scenario
being executed by the application
• Say if the top-boss logs in and wants to see all the levels
(max. visibility level), the query will take the most time.

Measurement Tools
• MS-SQL
• Query Profiler
• MySQL
• we noted the duration (excluding fetch time) from MySQL
workbench
• Neo4j
• Executed Parameteric queries programatically in Embedded Mode.
• Did not use Neo4j shell for measurements as its intended to be an
Ops tool (and not a transactional tool).

1) Gather Subordinates names till a visibility level from a current level
return nodes(p)
Where visibility level is a number that indicates the number of Gather Subordinate Names
return nodes(p)
Where visibility level is a number that indicates the number of For SQL we have to recursively add joins for each level, a generic SELECT manager.pid AS Bigboss,
...
ON L1Reportees.directly_manages = L2Reportees.pid
Query
1) We have varied total hierarchy ...
...
levels in the organization from people. Optimized generic WHERE manager.Cypher pid = (SELECT query id
FROM person
is:
WHERE name = "fName lName")
Visibility
Database Queries
Level
2
Neo4j start n = node:Person(name = "fName match n-[:DIRECTLY_MANAGES*1..2]->m
return nodes(p)
SELECT manager.pid AS Bigboss, manager.Subordinate, L1Reportees.directly_manages MySQL/
MSSQL
Gather Subordinates names till a visibility level from a current level
CURRENT
LEVEL
Names
We have varied total hierarchy levels in the organization from 3 to 8 for different volumes of
people. Optimized generic Cypher query is:
return nodes(p)
Where visibility level is a number that indicates the number of levels to show.
For SQL we have to recursively add joins for each level, a generic SQL can be written as:
SELECT manager.pid AS Bigboss,
...

5 5 124 16 5616 5460 125 31 1377 757 1830 913 194 6 109 0 62 15 47 0 212 43 138 13 59 0 813 166 2848 763 1533 221
6 125 0 156 62 63 0 118 78 278 72 26 0 699 171 3607 1124 1387 319
7 109 16 94 47 62 0 151 90 420 71 13 0 728 177 3123 929 1457 357
8 110 0 141 31 78 0 231 46 621 84 18 0 721 164 2879 1040 906 207
109 0 9734 9625 140 31 1006 811 3002 1274 159 7 93 0 16271 16100 156 31 1125 784 2947 1566 158 8 Gather Subordinate Names
1000 People - Gather Subordinate Names
200
150
100
50
0
164
178
166 171 177
164
3 4 5 6 7 8
Query Execution Time (ms)
Levels
Volume 1M People Org
109 0 25334 25287 125 32 1285 951 4256 2171 510 100K People - Gather Subordinate Names
0
1000 People - Gather Subordinate Aggregate
1500
1125
750
MySQL - LEFT Join MySQL - INNER Neo4j
375
Cold Warm Cold Warm Cold 3 760
1000
750
500
851
763
1124
929
1040
250
166 167 167 175 177 183
3 4 5 6 7 8
Levels
10K People - Gather Subordinate Names
400
300
200
221
100
0
-100
1M People - Gather Subordinate Names
1000 People - Overall Aggregate
269
185 183
171
221
10K People - Gather Subordinate Aggregate
3000
2250
319
9000
6750
357
4500
207
1500
750
2250
898 952
3 4 5 6 7 8
Levels
Warm Cache Plots
14
Levels
MySQL MSSQL Neo4j
Subordinate Names query for Warm 1M, Cache 2M Plots
Level 8, and
Volume 10K People Org
MySQL MS-SQL Neo4j
Lvls Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm
3 109 0 140 31 109 15 172 107 294 181 103 1 730 176 6129 898 2292 613
4 141 16 343 234 109 0 229 116 1267 156 102 1 776 178 5865 952 2073 455
5 94 15 640 561 94 15 270 164 403 188 327 1 727 184 5718 1194 2119 328
6 94 0 998 905 78 0 401 123 650 254 38 3 742 221 5692 1458 3182 453
7 93 0 1482 1326 78 0 303 177 870 374 68 1 705 185 5710 1188 1990 456
8 93 0 2745 2683 78 0 387 163 1146 573 71 1 713 171 6002 1271 1976 36
300
225
150
75
0
176 178 184
3 4 5 6 7 8
Levels
0
1194
1458
600
450
300
1M People Query 100K People - Overall Aggregate
1188 1271
150
176 182 184 173 177 153
3 4 5 6 7 8
Levels
10K People - Overall Aggregate
3 4 5 6 7 8
Levels
100K People 30000
22500
15000
7500
Query 0
3 4404 Execution Time (ms)
0
594
300000
225000
566 497
3 4 5 6 7 573 592
150000
75000
Levels
Warm Cache Plots
MySQL MS-SQL Lvls Gather
Subordin
ate
Names
Gather Subordinate
Aggregate
Overall
Aggregate
Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Cold War
m
Cold Warm Cold War
m
109 16 10998 10171 452 281 6337 4973 22703 20037 798 4 172 0 20467 19734 515 280 5677 5280 14888 7813 1245 5 172 16 60606 60653 531 281 7111 5959 12047 8293 1211 6 141 0 92789 92867 483 265 7278 6266 15226 7076 1251 7 218 0 160151 158793 499 280 11890 6724 15226 7076 1330 8 265 0 281301 280630 577 265 11300 8777 20560 14564 976 0
-2250
3 4 5 6 7 8
Levels
0
3 44265
Execution Time (ms)
1M People - Overall Aggregate
execution time for MySQL. Below are the results of
application? Say, if the situation demanded that we change
would find ourselves in a bad position performance
that Neo4j’s performance is almost constant time
factor-in INDIRECTLY_MANAGES relationship,
translate to additional join for SQL, thus bloating
increasing complexity. This will further degrade the
take more time.
1M-3 1M-5 1M-7 2M-8
-17500
0
17500
35000
52500
70000
LEFT Vs INNER Vs Neo4j
Org Size-Levels
Warm Cache Plots

queries using inner join on 1M (all levels), 2M and 3M (for level 8), we
increase in query execution time for MySQL. Below are the results of
MySQL - Left Vs Inner Join
from the Gather Subordinate Names query for 1M, 2M Level 8, and
70000
52500
35000
17500
0
-17500
Warm Cache Plots
1M-3 1M-5 1M-7 2M-8
Org Size-Levels
Neo4j
JOIN
warm cold warm
16 718 176
0 740 182
874 721 184
312 709 173
3432 700 177
15896 835 153
36301 822 149
61776 744 148
us or any application? Say, if the situation demanded that we change

Performance
1. Cartesian Product
All possible combinations

Performance
2. Filter

Performance
Data
Size
2. Filter
DS

Performance
Data
Size
2. Filter
DS
Time

Performance
Joins
Data
Size
2. Filter
| J
DS
Time

Performance
Joins
Data
Size
2. Filter
Project
People
Skills
| J
DS
Time

Performance
Joins
Data
Size
2. Filter
Project
People
Skills
DS
Time
Query is now localised to a
section of graph,
solves large data-size problem
| J

Performance
Joins
Data
Size
2. Filter
Project
People
Skills
DS
Time
Traversal Query is now localised to a
section of graph,
| J

Performance
Joins
Data
Size
2. Filter
Project
People
Skills
Traversal Query is now localised to a
DS
Time Time
section of graph,
| J

Relationships
• Not first-class citizens in RDBMS’ and NoSQL aggregate stores
• document stores or key-value stores.
• Cost of executing connected queries is high.
• “Who reports immediately to Smarty Pants?” - not a problem
• But, “Who all report to Smarty Pants?” - introduces recursive joins and
going down more than 5-6 levels the space and time complexity is very
high.
• “What skills does this person have?“ is relatively cheaper than “which
people have these skills?”
• what about - “which are people who have these skills also have those
skills?”

Gather Subordinates
Aggregate Data
Volume 10K People Org
MySQL MS-SQL Neo4j
Gather
Subordinate
Aggregate
People - Gather Subordinate Names
Subordinate Names query for 1M, 2M Level 8, and
178
Levels
14
166 171 177
MySQL MSSQL Neo4j
Warm Cache Plots
Volume 1M People Org
MySQL MS-SQL Neo4j
30000
4404 4697 5516 6145 6915 8304
1M People - Gather Subordinate Aggregate
73121
take more time.
Query Execution Time (ms) Levels
166 167 167 175 177 183
600
450
300
150
566 594
-17500
1000 People - Gather Subordinate Aggregate
1040
1000
750
500
250
1188 1271
0
1124
9000
6750
4500
1458
2250
0
17500
851
763
207
898 952
1194
35000
1500
1125
750
760
People - Gather Subordinate Names
3000
Query Execution Time (ms) Levels
2250
319
357
1500
750
52500
164
221
185 171
70000
1M-3 1M-5 1M-7 2M-8
929
176 182 184 173 177 153
Org Size-Levels
Warm Cache Plots
Subordinate
Aggregate
Aggregate
Subordinate
Names
Subordinate
Aggregate
Aggregate
Subordinate
Names
Subordinate
Aggregate
Aggregate
Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm
109 0 93 0 226 0 49 0 6 0 679 164 3329 760 1538 269
94 16 78 16 114 82 133 61 35 0 670 178 2878 851 892 183
62 15 47 0 212 43 138 13 59 0 813 166 2848 763 1533 221
156 62 63 0 118 78 278 72 26 0 699 171 3607 1124 1387 319
94 47 62 0 151 90 420 71 13 0 728 177 3123 929 1457 357
141 31 78 0 231 46 621 84 18 0 721 164 2879 1040 906 207
4 5 6 7 8
Levels
375
3 4 5 6 7 8
Levels
400
300
200
221
100
0
-100
269
183
3 4 5 6 7 8
Warm Cache Plots
MySQL MSSQL Neo4j
Overall
Aggregate
Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm
140 31 109 15 172 107 294 181 103 1 730 176 6129 898 2292 613
343 234 109 0 229 116 1267 156 102 1 776 178 5865 952 2073 455
640 561 94 15 270 164 403 188 327 1 727 184 5718 1194 2119 328
998 905 78 0 401 123 650 254 38 3 742 221 5692 1458 3182 453
1482 1326 78 0 303 177 870 374 68 1 705 185 5710 1188 1990 456
2745 2683 78 0 387 163 1146 573 71 1 713 171 6002 1271 1976 36
178 184
5 6 7 8
Levels
0
3 4 5 6 7 8
Levels
700
Subordinate
Names
Subordinate
Aggregate
Aggregate
Subordinate
Names
Subordinate
Aggregate
Aggregate
Subordinate
Names
Subordinate
Aggregate
Aggregate
3 124 0 1107 998 140 31 617 537 1201 608 168 12 736 166 13340 4404 5157 566
4 109 0 2231 2106 140 31 1224 650 1263 667 157 19 812 167 14686 4697 5396 594
5 124 16 5616 5460 125 31 1377 757 1830 913 194 12 713 167 15351 5516 5118 497
6 109 0 9734 9625 140 31 1006 811 3002 1274 159 12 756 175 17056 6145 4736 573
7 93 0 16271 16100 156 31 1125 784 2947 1566 158 14 700 177 17821 6915 4592 592
8 109 0 25334 25287 125 32 1285 951 4256 2171 510 14 691 183 19881 8304 4731 557
0
3 4 5 6 7 8
22500
15000
7500
0
3 4 5 6 7 8
Levels
0
497
300000
225000
573 592
557
150000
75000
106260
44265
3 4 5 6 7 8
Levels
Warm Cache Plots
Lvls Gather
Subordin
ate
Names
Gather Subordinate
Aggregate
Overall
Aggregate
Gather
Subordinate
Names
Gather
Subordinate
Aggregate
Overall
Aggregate
Gather
Subordinat
e Names
Gather Subordinate
Aggregate
Overall
Aggregate
Cold War
m
Cold Warm Cold War
m
Cold Warm Cold Warm Cold War
m
Cold War
m
Cold Warm Cold War
m
3 109 16 10998 10171 452 281 6337 4973 22703 20037 798 116 718 176 109432 44265 37190 2109
4 172 0 20467 19734 515 280 5677 5280 14888 7813 1245 114 740 182 115011 106260 36033 2194
5 172 16 60606 60653 531 281 7111 5959 12047 8293 1211 140 721 184 169123 77528 34068 2248
6 141 0 92789 92867 483 265 7278 6266 15226 7076 1251 140 709 173 255813 124810 33015 2107
7 218 0 160151 158793 499 280 11890 6724 15226 7076 1330 111 700 177 171607 125152 32813 1960
8 265 0 281301 280630 577 265 11300 8777 20560 14564 976 121 835 153 136432 73121 23673 1912
-2250
3 4 5 6 7 8
Levels
0
124810125152
77528
3 4 5 6 7 8
Levels
3000

Gather Overall Aggregate
Query
start n = node(*) !
return n.level as Level, count(n) as Total!
order by Level
SELECT level, count(id)!
FROM person!
GROUP BY level

164
178
166 171 177
164
1125
750
Query Execution Time Levels
500
1124
3 109 0 140 31 109 15 172 107 294 750
181 103 1 730 176 6129 929
898 2292 1040
613
4 141 16 343 234 109 0 229 116 1267 156 102 1 776 178 5865 952 2073 455
5 94 15 640 561 94 15 270 164 403 188 327 1 727 184 5718 1194 2119 328
6 94 0 998 905 78 0 401 123 650 254 38 3 742 221 5692 1458 3182 453
7 93 0 1482 1326 78 0 303 177 870 374 68 1 705 185 5710 1188 1990 456
8 93 0 2745 2683 78 0 387 163 1146 573 71 1 713 171 6002 1271 1976 36
22500
Query Execution Time Levels
15000
7500
Gather Overall 851
763
250
Aggregate
166 167 167 175 177 183
Data
150
100
50
0
3 4 5 6 7 8
Levels
375
760
3 4 5 6 7 8
Levels
400
300
200
221
100
0
-100
269
185 183
171
221
3000
2250
319
357
207
1500
750
898 952
3 4 5 6 7 8
Levels
600
450
300
150
176 182 184 173 177 153
3000
14
Warm Cache Plots
MySQL MSSQL Neo4j
14
Levels
MySQL MSSQL Neo4j
Subordinate Names query for 1M, 2M Level 8, and
Warm Cache Plots
566 594
4404 4697 5516 6145 6915 8304
1M People - Gather Subordinate Aggregate
300000
573 592
225000
557
106260
124810125152
77528
2109 2194 2248 2107 1960 1912
take more time.
9000
6750
4500
1194
1458
2250
0
-17500
453 456
0
328
17500
455
35000
700
525
350
175
0
613
52500
176 178 184
70000
1M-3 1M-5 1M-7 2M-8
36
1188 1271
Org Size-Levels
300
225
150
75
Warm Cache Plots
0
3 4 5 6 7 8
Levels
0
3 4 5 6 7 8
Levels
-175
3 4 5 6 7 8
Levels
Warm Cache Plots
MySQL MSSQL Neo4j
0
3 4 5 6 7 8
0
3 4 5 6 7 8
0
497
150000
75000
44265
3 4 5 6 7 8
Levels
16
Warm Cache Plots
MySQL MSSQL Neo4J
Names
Cold War
m
Cold Warm Cold War
m
Cold Warm Cold Warm Cold War
m
Cold War
m
Cold Warm Cold War
m
3 109 16 10998 10171 452 281 6337 4973 22703 20037 798 116 718 176 109432 44265 37190 2109
4 172 0 20467 19734 515 280 5677 5280 14888 7813 1245 114 740 182 115011 106260 36033 2194
5 172 16 60606 60653 531 281 7111 5959 12047 8293 1211 140 721 184 169123 77528 34068 2248
6 141 0 92789 92867 483 265 7278 6266 15226 7076 1251 140 709 173 255813 124810 33015 2107
7 218 0 160151 158793 499 280 11890 6724 15226 7076 1330 111 700 177 171607 125152 32813 1960
8 265 0 281301 280630 577 265 11300 8777 20560 14564 976 121 835 153 136432 73121 23673 1912
Warm Cache Plots
-2250
3 4 5 6 7 8
Levels
0
73121
3 4 5 6 7 8
Levels
2250
1500
750
0
3 4 5 6 7 8
Levels
17
MySQL MSSQL Neo4J

But, what if...
• I need to aggregate data, placing an OLAPish demand on
the data?
• Use non-graph stores: RDBMS or NoSQL alongside.
• Graph Compute Engines are optimised for scanning
and processing large amounts of information in batch.
• Giraph, Pegasus etc…
• Polyglot persistence is a norm.
• Has anyone tried Datomic?

Asking Questions
• Is your data connected?
• Or does your domain naturally gravitate towards or has
good number of joins (Dense Connections) leading to
explosion with large data?
• For example: Making Recommendations
• Or are you finding yourself writing complex SQL
Queries?
• For example: recursive joins or lots of joins

References
• Graph Databases
• Jim Webber, Ian Robinson and Emil Eifrem
• Apiary: A case for Neo4j?
• Anuj Mathur and Dhaval Dalal
• Code and scaffolding programs that we used are available
on - https://github.com/ EqualExperts/Apiary-Neo4j-
RDBMS-Comparison

A case-for-graph-db

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (20)

Similar to A case-for-graph-db

Similar to A case-for-graph-db (20)

More from Dhaval Dalal

More from Dhaval Dalal (20)

Recently uploaded

Recently uploaded (20)

A case-for-graph-db