These days we hear a lot about NoSQL and particularly graph database. But before jumping straight into development using any graph database, we should ask the following questions - 'what makes it a case for GraphDB? And can you prove it?' Basically de-risking and making a case for management buy in. Further, its important to convince ourselves.
This talk highlights some cases where GraphDB will be useful. Followed by some insights from a comparison we did between Neo4j and MySQL/MSSQL. By the end of this session, you'll understand the advantages of using a GraphDB and also what questions to ask before selecting a GraphDB.
This was presented at TechJam on 11th Sept 2014
Designing IA for AI - Information Architecture Conference 2024
A case-for-graph-db
1. A case for
Graph Database?
dhaval.dalal@software-artisan.com
!
@softwareartisan
11th Sept 2014
2. Context
Direct and Cross-Functional reporting represents a network even for
a simple organisation.
What about modelling a group?
3. Apiary Functionality
Structural Operations
Mine Organisational
Data
!
• Expand/Collapse levels
• View lineage
• Summary Data at all levels
• CRUD on all data (nodes/
relationships)
• Link/De-link Sub-graphs or
nodes
• Evolving attributes of nodes
and relationships based
• Adding new nodes and
relationships
!
• Affinities Graph - Who talks to
whom the most
• Discover Skills Communities
• Detecting overlap using SLPA
(Speaker-listener Label
Propogation)
4. Convince ourself first!
• Anyone should ask - “What makes it a case for
Graph DB? and Can you prove it?”
• Its basically a de-risking act.
• Two major aspects that we looked at
• Flexibility in schema evolution
• Performance
5. What to compare against?
• RDBMS’ are a natural choice to be compared against.
• MongoDB, though a NoSQL document store
• good for storing DDD style aggregates.
• not for inter-connected data.
• We picked Neo4j
• But remember, this is not a battle, we are just trying to
find out when you should use what!
9. Flexibility in Evolution
• Entity Diversity
• Different kinds of nodes
• Connection Diversity
• links could have different weights, directions.
10. Flexibility in Evolution
• Entity Diversity
• Different kinds of nodes
• Connection Diversity
• links could have different weights, directions.
• Evolution of Entities and Links, themselves over time.
• Varietal data needs
• Is every node/link structured regularly or irregularly,
connected or disconnected nodes etc…
11. Minimal set of functionality
Analysis Model (Phase 1)
Analysis Model (Phase 1)
Neo4J Domain Model
1) Neo4J Domain Model
Node Properties
Person name
Node Properties
Person name
type
level
type
level
Relationships Properties
DIRECTLY_MANAGES N/A
Note: For the purpose of establishing the case, we have modeled minimal relationships and
not all the relationships that would have been in the final application. Below is a list of
relationships that are yet pending to be modeled, but are not relevant for the purposes of
taking performance measurements.
Relationships Properties
DIRECTLY_MANAGES Relationships N/A
Properties
2) SQL Domain Model
Queries
Above screen-flow and modeling for organizations and groups use cases requires us to run
12. Measured performance of 3
1) Gather Subordinates names till a visibility level from a current level
CURRENT
LEVEL
queries
We have varied total hierarchy levels in the organization Aggregate
from 3 to 8 for people. Optimized generic Cypher query is:
Data
start n = node:Person(name = "fName lName")
match p = n-[:DIRECTLY_MANAGES*1..visibilityLevel]->m
return nodes(p)
Where visibility level is a number that indicates the number of levels to show.
For SQL we have to recursively add joins for each level, a generic SQL can SELECT manager.pid AS Bigboss,
manager.directly_manages AS Subordinate,
L1Reportees.directly_manages AS Reportee1,
L2Reportees.directly_manages AS Reportee2,
...
FROM person_reportee manager
LEFT JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
LEFT JOIN person_reportee L2Reportees
ON L1Reportees.directly_manages = L2Reportees.pid
...
...
Current
Level
WHERE manager.pid = (SELECT id
FROM person
WHERE name = "fName lName")
Names
• Subordinate names from
current level until a visibility
level
• Aggregate Data from current
level until a visibility level
• Overall Aggregate Data for
Dashboard
• distribution of people at various
levels
2) Gather Subordinates Aggregate data from current level
We have varied total hierarchy levels in the
organization from 3 to 8 for different volumes
of people.
Further, the optimized Generic Cypher query is:
start n = node:Person(name = "fName lName")
match n-[:DIRECTLY_MANAGES*0..(totalLevels -
n.level)]->m-[:DIRECTLY_MANAGES*1..(totalLevels
- n.level)]->o
where n.level + visibilityLevel >= m.level
return m.name as Subordinate, count(o) as Total
For SQL aggregate query, we not only have to recursively add joins but also perform inner
unions for each level till the last level to obtain the aggregate data for that level. Once we
obtain the data for a particular level (per person), we perform outer unions to get the final
result for all the levels. This results in a very big SQL query.
Here is a sample query that returns aggregate data for 3 levels below the current level. Say,
13. Apple-to-Apple comparison
• MySQL, MS-SQL and Neo4j
• We did not use Traversal API (though faster), just as
Cypher is to Neo4j as SQL is to RDBMS’
• For longer term, Neo4j intends to further Cypher Query planning
and optimisation.
• Indexing
• Enabled on join columns for MySQL and MS-SQL DBs
• For Neo4j, Person names were indexed
14. Environment Consistency
• Same machine for all databases
• MySQL v5.6.12, MS-SQL Server 2008 R2, Neo4j v1.9
(Advanced)
• DB and tools on the same machine
• Avoid network transport times being factored in.
• Out-of-box settings for all DBs apart from giving 3
GB to the java process that ran Neo4j in Embedded
mode.
15. Functional Equivalence
• Consistent data distribution.
• 8 Levels in org with people at each level managing the next
for all DBs.
• Functionally equivalent queries.
• Measurements for worst possible queries scenario
being executed by the application
• Say if the top-boss logs in and wants to see all the levels
(max. visibility level), the query will take the most time.
16. Measurement Tools
• MS-SQL
• Query Profiler
• MySQL
• we noted the duration (excluding fetch time) from MySQL
workbench
• Neo4j
• Executed Parameteric queries programatically in Embedded Mode.
• Did not use Neo4j shell for measurements as its intended to be an
Ops tool (and not a transactional tool).
17. 1) Gather Subordinates names till a visibility level from a current level
start n = node:Person(name = "fName lName")
start n = node:Person(name = "fName lName")
match p = n-[:DIRECTLY_MANAGES*1..visibilityLevel]->m
return nodes(p)
Where visibility level is a number that indicates the number of Gather Subordinate Names
match p = n-[:DIRECTLY_MANAGES*1..visibilityLevel]->m
return nodes(p)
Where visibility level is a number that indicates the number of For SQL we have to recursively add joins for each level, a generic SELECT manager.pid AS Bigboss,
manager.directly_manages AS Subordinate,
L1Reportees.directly_manages AS Reportee1,
L2Reportees.directly_manages AS Reportee2,
...
FROM person_reportee manager
LEFT JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
LEFT JOIN person_reportee L2Reportees
ON L1Reportees.directly_manages = L2Reportees.pid
Query
1) We have varied total hierarchy ...
...
levels in the organization from people. Optimized generic WHERE manager.Cypher pid = (SELECT query id
FROM person
is:
WHERE name = "fName lName")
Visibility
Database Queries
Level
2
Neo4j start n = node:Person(name = "fName match n-[:DIRECTLY_MANAGES*1..2]->m
return nodes(p)
SELECT manager.pid AS Bigboss, manager.Subordinate, L1Reportees.directly_manages MySQL/
MSSQL
Gather Subordinates names till a visibility level from a current level
CURRENT
LEVEL
Names
We have varied total hierarchy levels in the organization from 3 to 8 for different volumes of
people. Optimized generic Cypher query is:
start n = node:Person(name = "fName lName")
match p = n-[:DIRECTLY_MANAGES*1..visibilityLevel]->m
return nodes(p)
Where visibility level is a number that indicates the number of levels to show.
For SQL we have to recursively add joins for each level, a generic SQL can be written as:
SELECT manager.pid AS Bigboss,
manager.directly_manages AS Subordinate,
L1Reportees.directly_manages AS Reportee1,
L2Reportees.directly_manages AS Reportee2,
...
FROM person_reportee manager
LEFT JOIN person_reportee L1Reportees
ON manager.directly_manages = L1Reportees.pid
LEFT JOIN person_reportee L2Reportees
19. queries using inner join on 1M (all levels), 2M and 3M (for level 8), we
increase in query execution time for MySQL. Below are the results of
MySQL - Left Vs Inner Join
from the Gather Subordinate Names query for 1M, 2M Level 8, and
70000
52500
35000
17500
0
-17500
Warm Cache Plots
LEFT Vs INNER Vs Neo4j
1M-3 1M-5 1M-7 2M-8
Query Execution Time (ms)
Org Size-Levels
MySQL - LEFT Join MySQL - INNER Neo4j
Neo4j
JOIN
warm cold warm
16 718 176
0 740 182
874 721 184
312 709 173
3432 700 177
15896 835 153
36301 822 149
61776 744 148
us or any application? Say, if the situation demanded that we change
27. Performance
Data
Size
1. Cartesian Product
All possible combinations
2. Filter
DS
Time
28. Performance
Joins
Data
Size
1. Cartesian Product
All possible combinations
2. Filter
| J
DS
Time
29. Performance
Joins
Data
Size
1. Cartesian Product
All possible combinations
2. Filter
Project
People
Skills
| J
DS
Time
30. Performance
Joins
Data
Size
1. Cartesian Product
All possible combinations
2. Filter
Project
People
Skills
DS
Time
Query is now localised to a
section of graph,
solves large data-size problem
| J
31. Performance
Joins
Data
Size
1. Cartesian Product
All possible combinations
2. Filter
Project
People
Skills
DS
Time
Traversal Query is now localised to a
section of graph,
solves large data-size problem
| J
32. Performance
Joins
Data
Size
1. Cartesian Product
All possible combinations
2. Filter
Project
People
Skills
Traversal Query is now localised to a
DS
Time Time
section of graph,
solves large data-size problem
| J
33. Relationships
• Not first-class citizens in RDBMS’ and NoSQL aggregate stores
• document stores or key-value stores.
• Cost of executing connected queries is high.
• “Who reports immediately to Smarty Pants?” - not a problem
• But, “Who all report to Smarty Pants?” - introduces recursive joins and
going down more than 5-6 levels the space and time complexity is very
high.
• “What skills does this person have?“ is relatively cheaper than “which
people have these skills?”
• what about - “which are people who have these skills also have those
skills?”
35. Gather Overall Aggregate
Query
start n = node(*) !
return n.level as Level, count(n) as Total!
order by Level
SELECT level, count(id)!
FROM person!
GROUP BY level
36. 164
178
166 171 177
164
1125
750
Query Execution Time Levels
Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm Cold Warm
500
1124
3 109 0 140 31 109 15 172 107 294 750
181 103 1 730 176 6129 929
898 2292 1040
613
4 141 16 343 234 109 0 229 116 1267 156 102 1 776 178 5865 952 2073 455
5 94 15 640 561 94 15 270 164 403 188 327 1 727 184 5718 1194 2119 328
6 94 0 998 905 78 0 401 123 650 254 38 3 742 221 5692 1458 3182 453
7 93 0 1482 1326 78 0 303 177 870 374 68 1 705 185 5710 1188 1990 456
8 93 0 2745 2683 78 0 387 163 1146 573 71 1 713 171 6002 1271 1976 36
22500
Query Execution Time Levels
15000
7500
Gather Overall 851
763
250
Aggregate
166 167 167 175 177 183
Data
150
100
50
0
3 4 5 6 7 8
Query Execution Time (ms)
Levels
375
760
3 4 5 6 7 8
Query Execution Time (ms)
Levels
10K People - Gather Subordinate Names
400
300
200
221
100
0
-100
1000 People - Overall Aggregate
269
185 183
171
221
10K People - Gather Subordinate Aggregate
3000
2250
319
357
207
1500
750
898 952
3 4 5 6 7 8
Query Execution Time (ms)
Levels
1M People - Gather Subordinate Names
600
450
300
150
176 182 184 173 177 153
3000
14
Warm Cache Plots
10K People - Overall Aggregate
MySQL MSSQL Neo4j
14
Levels
MySQL MSSQL Neo4j
Subordinate Names query for 1M, 2M Level 8, and
execution time for MySQL. Below are the results of
Warm Cache Plots
100K People - Overall Aggregate
566 594
application? Say, if the situation demanded that we change
would find ourselves in a bad position performance
that Neo4j’s performance is almost constant time
4404 4697 5516 6145 6915 8304
1M People - Gather Subordinate Aggregate
300000
573 592
225000
factor-in INDIRECTLY_MANAGES relationship,
translate to additional join for SQL, thus bloating
557
increasing complexity. This will further degrade the
106260
124810125152
77528
1M People - Overall Aggregate
2109 2194 2248 2107 1960 1912
take more time.
9000
6750
4500
1194
1458
2250
0
-17500
453 456
0
328
17500
455
35000
700
525
350
175
0
613
52500
176 178 184
70000
1M-3 1M-5 1M-7 2M-8
LEFT Vs INNER Vs Neo4j
36
Query Execution Time (ms)
MySQL - LEFT Join MySQL - INNER Neo4j
1188 1271
Org Size-Levels
300
225
150
75
Warm Cache Plots
0
3 4 5 6 7 8
Query Execution Time (ms)
Levels
0
3 4 5 6 7 8
Query Execution Time (ms)
Levels
-175
3 4 5 6 7 8
Query Execution Time (ms)
Levels
Warm Cache Plots
MySQL MSSQL Neo4j
0
3 4 5 6 7 8
0
3 4 5 6 7 8
0
497
150000
75000
44265
3 4 5 6 7 8
Query Execution Time (ms)
Levels
16
Warm Cache Plots
MySQL MSSQL Neo4J
Names
Cold War
m
Cold Warm Cold War
m
Cold Warm Cold Warm Cold War
m
Cold War
m
Cold Warm Cold War
m
3 109 16 10998 10171 452 281 6337 4973 22703 20037 798 116 718 176 109432 44265 37190 2109
4 172 0 20467 19734 515 280 5677 5280 14888 7813 1245 114 740 182 115011 106260 36033 2194
5 172 16 60606 60653 531 281 7111 5959 12047 8293 1211 140 721 184 169123 77528 34068 2248
6 141 0 92789 92867 483 265 7278 6266 15226 7076 1251 140 709 173 255813 124810 33015 2107
7 218 0 160151 158793 499 280 11890 6724 15226 7076 1330 111 700 177 171607 125152 32813 1960
8 265 0 281301 280630 577 265 11300 8777 20560 14564 976 121 835 153 136432 73121 23673 1912
Warm Cache Plots
-2250
3 4 5 6 7 8
Query Execution Time (ms)
Levels
0
73121
3 4 5 6 7 8
Query Execution Time (ms)
Levels
2250
1500
750
0
3 4 5 6 7 8
Query Execution Time (ms)
Levels
17
MySQL MSSQL Neo4J
37. But, what if...
• I need to aggregate data, placing an OLAPish demand on
the data?
• Use non-graph stores: RDBMS or NoSQL alongside.
• Graph Compute Engines are optimised for scanning
and processing large amounts of information in batch.
• Giraph, Pegasus etc…
• Polyglot persistence is a norm.
• Has anyone tried Datomic?
38. Asking Questions
• Is your data connected?
• Or does your domain naturally gravitate towards or has
good number of joins (Dense Connections) leading to
explosion with large data?
• For example: Making Recommendations
• Or are you finding yourself writing complex SQL
Queries?
• For example: recursive joins or lots of joins
40. References
• Graph Databases
• Jim Webber, Ian Robinson and Emil Eifrem
• Apiary: A case for Neo4j?
• Anuj Mathur and Dhaval Dalal
• Code and scaffolding programs that we used are available
on - https://github.com/ EqualExperts/Apiary-Neo4j-
RDBMS-Comparison