Taming Large Databases

Neo4j, Inc. All rights reserved 2022
1
Taming Large Databases
Ravindranatha Anthapu
Principal Consultant – Professional Services

2
Objectives
• What are Large Databases
• What issues are faced
◦ How to identify,
◦ What approaches can work,
◦ And how to educate customers to avoid these issues
• Understand some common mistakes (and how to avoid them!)

3
What are Large Databases and Why?
• How this was determined
• Not a technical limitation of Neo4j.
• More driven by licensing and infrastructure cost.
• Driven by issues observed in the field.
• What are Large Databases
◦ Databases bigger than 512 GB
• Large Databases are unforgiving with non-optimal models
• Query SLA’s can be hugely impacted.

4
Issues Observed
• Performance issues faced
• Data model not optimal
• Over reliance on indexes
• Over reliance on Property based Conditional Traversals
• Not understanding how property access works*
• Bad Write performance
• Not understanding how locking works in Neo4j*
* Addressed in 5.0 with new store format

5
Identify issues
• We will take a look different use cases gathered
• Identify what the issues can be
• Review the options to address the issues

6
Scenario 1
Node and Relationship stores
normal
Graph Data Size
● Node Store - 13 GB
● Relationships - 45 GB
● Property Store - 207 GB
● Property (Arrays) - 7 GB
● Property (Strings) - 149 GB
● Indexes - 166 GB
● Total - 587 GB
6
Index store size is too big.
Indicates over reliance on indexes
Property store is huge. Not an issue unless
they are being accessed too often. Query
performance depends on how it is written.
String Property store also huge. This means
there are lot of string properties. Query
performance depends on how it is written

7
Scenario 1 - Continued
7
After:346 total db hits in 4 ms.
Query is retrieving node once using index and using that node to traverse and make
decisions.
MATCH(profile:Profile {profileType: $profileType, profileId: toInteger($profileId)})
OPTIONAL MATCH (profile)-[:HAS_ACCESS_TO|:HAS_ADMIN_ACCESS_TO]-
>(starVendor:Vendor {id: '*'})
CALL apoc.when(
starVendor IS NOT NULL,
'
MATCH (p: Person)
RETURN p as publisher
',
'
OPTIONAL MATCH (profile)-[:HAS_ACCESS_TO]->(: Content)-
[:HAS_ASSOCIATED_ CONTENT]->(pub1: Person)
OPTIONAL MATCH (profile)-[:HAS_ACCESS_TO]->(:Vendor)-[:OWNS]-
>(:SubAccount)-[:HAS_ASSOCIATED_ CONTENT]->(pub2: Person)
WITH COALESCE(pub1, pub2) as person
WHERE person IS NOT NULL
RETURN DISTINCT person
',
{
profile: profile,
startVendor: starVendor
}
)
YIELD value
WITH value. person as pub ORDER BY pub.name
WITH collect(pub) as persons, count(pub) as totalCount
RETURN persons[0..50] as publishers, totalCount
Before: 216608 total db hits in 202 ms
Query is retrieving the same node multiple times.
MATCH (p:Person)
WHERE ((p)
<-[:HAS_ASSOCIATED_CONTENT]-(:Content)
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType, profileId:
toInteger($profileId)})
OR (p)
<-[:HAS_ASSOCIATED_CONTENT]-(:SubAccount)
<-[:OWNS]-(:Vendor)
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType, profileId:
toInteger($profileId)})
OR EXISTS(
(:Vendor {id: ‘*’})
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType,
profileId: toInteger($profileId)})
))
WITH p ORDER BY p.name
WITH collect(p) as persons, count(p) as totalCount
RETURN persons[$offset..($offset + $limit)] as persons, totalCount

• Why the first query too more time than the modified one?
• Even though Profile has an index for name and type, it cannot leverage
indexes as it is part of traversal.
WHERE ((p) <-[:HAS_ASSOCIATED_CONTENT]-(:Content)
<-[:HAS_ACCESS_TO]-(:Profile {profileType: $profileType, profileId: toInteger($profileId)})
• Just retrieving Profile node first and then using as part of traversal
reduces the amount of work done by DB.
• Leverage index store and reduce the property based conditional
traversals.

9
Scenario 2
• DB has 1.5 billion nodes and 4.5
billion relationships.
• DB is 2.2 TB in size.
• Address has incoming and
outgoing transactions with
amount moving in/out of
address.
• Queries
◦ Return the addresses current
balance
◦ Return the total number of
transactions an address had
made
9

10
• Model is simple and does not suffer from other issues.
• Query tries to calculate Address current balance at run time.
• Works well for Addresses with small number of transactions.
• For transactions with millions of transactions it takes lot of time.
• Best way to address this issue is to leverage triggers(transaction
handlers) that can update the Address node with in/out flow amounts.
• * In Neo4j Transaction Handlers are at DB level, and not at Node and Relationship level. So, needs to be careful to not
create more than one trigger.

11
• Given model is fine for answering current questions.
• Is it enough to answer these future questions?
• For a given date range what is the in/out flow of amounts for the address
• Provide the daily summary of in/out flows for a given address.
• What was the activity for the last 6 months?
• Number of transactions
• In/out flow (daily and total)

12
12

13
• With updated model we get these advantages
• We can answer how many transactions happened for a given day
• We can answer how much amount exchanged for a given day
• We can answer lot of statistical answers using daily summary nodes
• For a given address we can answer total amount details, daily amount movement
details as well as for a given date range.
• All of these queries can be answered using a small amount of Page cache (<
64 GB) even for a large database like this (> 3 TB) in a few ms time.
• For scenarios wanting to look at each transaction users can be reasonable
about SLA as large amount of data being shown.

14
Scenario 3 – Product Recommendation
• DB is 1 TB in size.
• Around 1000 views/second are
added.
• Queries
◦ On visiting a page, for the last
5 products visited show
related products as
recommendation.
14

15
• Model is simple and straight forward.
• Ingestion rate can be impacted for most popular products due to locking.
• If the Browser has lot of page views, getting product recommendation can
take time.
• It also requires more page cache as we need to retrieve all the views the
browser associated with and sort them to get the latest views and
products associated with them.
• As the DB grows it would require more page cache to be able to answer
the questions quickly.

16
Scenario 3 – Continued
16

17
• By introducing the StoreProduct node we reduce the locking pressure on
Product node. This improves write performance.
• Another change is introducing the LATEST relationship that points to the
LATEST Product view. This acts as a pointer to the latest view.
• Also, We connect the Product views using PREV relationship.
• Using LATEST and PREV relationships we can traverse the views in the order
they were created without relying on reading properties and sorting them. This
reduces the pressure on page cache and property store.
• Creating a Java Stored Procedure to answer the query (traverse only the
required steps) can make the query performance more constant even when
database grows. (Reduce the working data set, so we can use page cache
more effectively)

Summary
• Review the DB store sizes to see how the DB is growing. These values
are available in the debug.log.
• Check indexes and understand how they are being used.
• Lookup/collect/sort and filter are costlier when compared to traversals.
• Review queries for conditional property traversals and see if they can be
avoided using other traversal patterns.
• If there is a pattern of traverse a path till a condition is satisfied,
leveraging stored procedures might help with reducing the pressure on
page cache giving consistent performance.

Summary
• Think outside box like having aggregated data as the data builds if those
attributes are most frequently accessed.

Thank you!
Questions?
Answers!
20

Taming Large Databases

Recommended

Recommended

More Related Content

Similar to Taming Large Databases

Similar to Taming Large Databases (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Taming Large Databases

Editor's Notes