Axa Assurance Maroc - Insurer Innovation Award 2024
Taming Large Databases
1. Neo4j, Inc. All rights reserved 2022
Neo4j, Inc. All rights reserved 2022
1
Taming Large Databases
Ravindranatha Anthapu
Principal Consultant – Professional Services
2. Neo4j, Inc. All rights reserved 2022
2
Objectives
• What are Large Databases
• What issues are faced
◦ How to identify,
◦ What approaches can work,
◦ And how to educate customers to avoid these issues
• Understand some common mistakes (and how to avoid them!)
3. Neo4j, Inc. All rights reserved 2022
3
What are Large Databases and Why?
• How this was determined
• Not a technical limitation of Neo4j.
• More driven by licensing and infrastructure cost.
• Driven by issues observed in the field.
• What are Large Databases
◦ Databases bigger than 512 GB
• Large Databases are unforgiving with non-optimal models
• Query SLA’s can be hugely impacted.
4. Neo4j, Inc. All rights reserved 2022
4
Issues Observed
• Performance issues faced
• Data model not optimal
• Over reliance on indexes
• Over reliance on Property based Conditional Traversals
• Not understanding how property access works*
• Bad Write performance
• Not understanding how locking works in Neo4j*
* Addressed in 5.0 with new store format
5. Neo4j, Inc. All rights reserved 2022
5
Identify issues
• We will take a look different use cases gathered
• Identify what the issues can be
• Review the options to address the issues
6. Neo4j, Inc. All rights reserved 2022
6
Scenario 1
Node and Relationship stores
normal
Graph Data Size
● Node Store - 13 GB
● Relationships - 45 GB
● Property Store - 207 GB
● Property (Arrays) - 7 GB
● Property (Strings) - 149 GB
● Indexes - 166 GB
● Total - 587 GB
6
Index store size is too big.
Indicates over reliance on indexes
Property store is huge. Not an issue unless
they are being accessed too often. Query
performance depends on how it is written.
String Property store also huge. This means
there are lot of string properties. Query
performance depends on how it is written
7. Neo4j, Inc. All rights reserved 2022
7
Scenario 1 - Continued
7
After:346 total db hits in 4 ms.
Query is retrieving node once using index and using that node to traverse and make
decisions.
MATCH(profile:Profile {profileType: $profileType, profileId: toInteger($profileId)})
OPTIONAL MATCH (profile)-[:HAS_ACCESS_TO|:HAS_ADMIN_ACCESS_TO]-
>(starVendor:Vendor {id: '*'})
CALL apoc.when(
starVendor IS NOT NULL,
'
MATCH (p: Person)
RETURN p as publisher
',
'
OPTIONAL MATCH (profile)-[:HAS_ACCESS_TO]->(: Content)-
[:HAS_ASSOCIATED_ CONTENT]->(pub1: Person)
OPTIONAL MATCH (profile)-[:HAS_ACCESS_TO]->(:Vendor)-[:OWNS]-
>(:SubAccount)-[:HAS_ASSOCIATED_ CONTENT]->(pub2: Person)
WITH COALESCE(pub1, pub2) as person
WHERE person IS NOT NULL
RETURN DISTINCT person
',
{
profile: profile,
startVendor: starVendor
}
)
YIELD value
WITH value. person as pub ORDER BY pub.name
WITH collect(pub) as persons, count(pub) as totalCount
RETURN persons[0..50] as publishers, totalCount
Before: 216608 total db hits in 202 ms
Query is retrieving the same node multiple times.
MATCH (p:Person)
WHERE ((p)
<-[:HAS_ASSOCIATED_CONTENT]-(:Content)
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType, profileId:
toInteger($profileId)})
OR (p)
<-[:HAS_ASSOCIATED_CONTENT]-(:SubAccount)
<-[:OWNS]-(:Vendor)
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType, profileId:
toInteger($profileId)})
OR EXISTS(
(:Vendor {id: ‘*’})
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType,
profileId: toInteger($profileId)})
))
WITH p ORDER BY p.name
WITH collect(p) as persons, count(p) as totalCount
RETURN persons[$offset..($offset + $limit)] as persons, totalCount
8. Neo4j, Inc. All rights reserved 2022
Scenario 1 - Continued
• Why the first query too more time than the modified one?
• Even though Profile has an index for name and type, it cannot leverage
indexes as it is part of traversal.
WHERE ((p) <-[:HAS_ASSOCIATED_CONTENT]-(:Content)
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType, profileId: toInteger($profileId)})
• Just retrieving Profile node first and then using as part of traversal
reduces the amount of work done by DB.
• Leverage index store and reduce the property based conditional
traversals.
9. Neo4j, Inc. All rights reserved 2022
9
Scenario 2
• DB has 1.5 billion nodes and 4.5
billion relationships.
• DB is 2.2 TB in size.
• Address has incoming and
outgoing transactions with
amount moving in/out of
address.
• Queries
◦ Return the addresses current
balance
◦ Return the total number of
transactions an address had
made
9
10. Neo4j, Inc. All rights reserved 2022
10
Scenario 2 - Continued
• Model is simple and does not suffer from other issues.
• Query tries to calculate Address current balance at run time.
• Works well for Addresses with small number of transactions.
• For transactions with millions of transactions it takes lot of time.
• Best way to address this issue is to leverage triggers(transaction
handlers) that can update the Address node with in/out flow amounts.
• * In Neo4j Transaction Handlers are at DB level, and not at Node and Relationship level. So, needs to be careful to not
create more than one trigger.
11. Neo4j, Inc. All rights reserved 2022
11
Scenario 2 - Continued
• Given model is fine for answering current questions.
• Is it enough to answer these future questions?
• For a given date range what is the in/out flow of amounts for the address
• Provide the daily summary of in/out flows for a given address.
• What was the activity for the last 6 months?
• Number of transactions
• In/out flow (daily and total)
12. Neo4j, Inc. All rights reserved 2022
12
Scenario 2 - Continued
12
13. Neo4j, Inc. All rights reserved 2022
13
Scenario 2 - Continued
• With updated model we get these advantages
• We can answer how many transactions happened for a given day
• We can answer how much amount exchanged for a given day
• We can answer lot of statistical answers using daily summary nodes
• For a given address we can answer total amount details, daily amount movement
details as well as for a given date range.
• All of these queries can be answered using a small amount of Page cache (<
64 GB) even for a large database like this (> 3 TB) in a few ms time.
• For scenarios wanting to look at each transaction users can be reasonable
about SLA as large amount of data being shown.
14. Neo4j, Inc. All rights reserved 2022
14
Scenario 3 – Product Recommendation
• DB is 1 TB in size.
• Around 1000 views/second are
added.
• Queries
◦ On visiting a page, for the last
5 products visited show
related products as
recommendation.
14
15. Neo4j, Inc. All rights reserved 2022
15
Scenario 3 - Continued
• Model is simple and straight forward.
• Ingestion rate can be impacted for most popular products due to locking.
• If the Browser has lot of page views, getting product recommendation can
take time.
• It also requires more page cache as we need to retrieve all the views the
browser associated with and sort them to get the latest views and
products associated with them.
• As the DB grows it would require more page cache to be able to answer
the questions quickly.
16. Neo4j, Inc. All rights reserved 2022
16
Scenario 3 – Continued
16
17. Neo4j, Inc. All rights reserved 2022
17
Scenario 3 - Continued
• By introducing the StoreProduct node we reduce the locking pressure on
Product node. This improves write performance.
• Another change is introducing the LATEST relationship that points to the
LATEST Product view. This acts as a pointer to the latest view.
• Also, We connect the Product views using PREV relationship.
• Using LATEST and PREV relationships we can traverse the views in the order
they were created without relying on reading properties and sorting them. This
reduces the pressure on page cache and property store.
• Creating a Java Stored Procedure to answer the query (traverse only the
required steps) can make the query performance more constant even when
database grows. (Reduce the working data set, so we can use page cache
more effectively)
18. Neo4j, Inc. All rights reserved 2022
Summary
• Review the DB store sizes to see how the DB is growing. These values
are available in the debug.log.
• Check indexes and understand how they are being used.
• Lookup/collect/sort and filter are costlier when compared to traversals.
• Review queries for conditional property traversals and see if they can be
avoided using other traversal patterns.
• If there is a pattern of traverse a path till a condition is satisfied,
leveraging stored procedures might help with reducing the pressure on
page cache giving consistent performance.
19. Neo4j, Inc. All rights reserved 2022
Summary
• Think outside box like having aggregated data as the data builds if those
attributes are most frequently accessed.
20. Neo4j, Inc. All rights reserved 2022
Neo4j, Inc. All rights reserved 2022
Thank you!
Questions?
Answers!
20
Editor's Notes
Databases greater than 512 GB can be considered large databases.
This is mainly due to how the customers are using Neo4J more than any other aspects. Most of the times these instances are using around 256 GB RAM.
This tag is mainly based on license and infrastructure cost. From my experience most of the customers are using 256 GB RAM instances for database this size.
Application with continuous ingestion of data can make it difficult to get a consistent backup.
Check consistency can be very time consuming
Using separate servers for backup
Say a backup is not consistent, what are the options to clean it for recovery purposes.
Databases greater than 512 GB can be considered large databases.
This is mainly due to how the customers are using Neo4J more than any other aspects. Most of the times these instances are using around 256 GB RAM.
This tag is mainly based on license and infrastructure cost. From my experience most of the customers are using 256 GB RAM instances for database this size.
Application with continuous ingestion of data can make it difficult to get a consistent backup.
Check consistency can be very time consuming
Using separate servers for backup
Say a backup is not consistent, what are the options to clean it for recovery purposes.