SlideShare a Scribd company logo
1 of 24
Download to read offline
Exploring Neo4j Graph Database
to build a
Fast Data Access Layer
with Near-Real time Data Ingestion
Sambit Banerjee
05-April-2020
Overview
The future state design considerations for the modernization initiatives of large scale legacy systems often require addressing
various patterns for accessing data from the backend data sources in near-real time by APIs, reports & queries, dashboards,
etc.
Some of the common solutions for such requirements include using –
• replicated reporting databases with optimized data model where data is transformed after replication from the
source databases
• data virtualization (with some degree of local caching of data) / data federation
These solutions work in a lot of cases depending on the degrees of acceptances from the stakeholders.
However, besides the challenges to fulfil the requirement of accessing data in near-real time, all of these solutions have
certain limitations in terms of initial implementation efforts, impacts to the consumer systems, and, and on-going
management of the same.
While working on such a modernization initiative for a large scale legacy system for one of my clients a few months ago,
exploring avenues for a different approach to address some of these challenges was well in order, and, it was decided to
conduct an extensive POC with the Neo4j Graph database.
It was a great exercise for me to have explored the Neo4j Graph database in deep details. The major part of the overall
outcome was pleasantly favorable, although some limitations were observed as well. This document explains the same.
NOTE:- In order to protect the business information of my client, I have used placeholders / fictitious names while describing
the use case, data model, data attributes, queries, etc. in this document.
Sambit Banerjee Page 2
POC Objective
 Investigate the feasibility of using Neo4j Graph database to establish a low-latency read-only data access
layer with the following characteristics –
1) the data from the data access layer is used by different consumers across the enterprise such as APIs, real-time
reporting (operational, management, ad-hoc), analytics, dashboards, and many more.
2) the data is structured, and, the data access layer is continuously hydrated from multiple backend RDBMS (including
large legacy databases) in near-real time, e.g., within 2-5 minutes of the source databases making the incremental
dataset available to the data access layer for consumption. The volume of the incremental data can be quite large,
e.g., 3 million business transactions generated in a few minutes, during the peak usage of the source systems.
3) maintain the data model of the data access layer as close as the source data models so that the codebases of the
existing reports and queries for APIs don’t have to go through complete or significant overhaul
4) the data access layer can support high performance complex queries (e.g., many joins, filters, sorts, grouping, etc.)
against large volume of structured data, and, produce the resultant dataset to the consumers in sub-seconds time
frame
 Demonstrate a complete use case with a high performance query run against the full data volume taken
from an existing large legacy production system
Sambit Banerjee Page 3
POC Use Case
Department XYZ has Managers and Agents to manage the Accounts of millions of
Customers. The Agents and the Managers run an operational report multiple
times throughout a business day to monitor the status of various transactions on
the Customer accounts, takes appropriate actions, and reports the same to the sr.
management. The distribution of the subset of overall operational workload
within the XYZ department is shown in this table 
Each execution of the selected operational report (based on an Oracle SQL query
– ref: Appendix A) runs against billions of records in the corresponding Oracle
database of the legacy system, and, typically completes in 6 to 8 minutes.
The goal of the POC was to –
a) build the same volume of dataset, while keeping the same data model,
in a Neo4j Graph database.
b) replicate the same Oracle SQL query in Neo4j Cypher query (ref:
Appendix B) with the same logic, and evaluate the performance of the
Neo4j Cypher query against that of the Oracle SQL query. In order to
replicate the similar operational scenario, the Neo4j query should run
with different degrees of session concurrency, with each session
representing either a Manager or an Agent.
c) after loading the initial data volume in the Neo4j Graph database, add
to it the incremental data generated in the legacy Oracle database
during the peak processing window, and, assess the load performance
of the incremental data in Neo4j Graph database.
Manager Agent
Customer
Count by
Agent
Account
Count by
Agent
Customer
Count by
Manager
Account
Count by
Manager
M1
M1-A1 151 313,692
1,081 2,336,592
M1-A2 211 441,344
M1-A3 200 441,744
M1-A4 203 373,115
M1-A5 154 211,451
M1-A6 23 7,816
M1-A7 139 547,430
M2
M2-A1 27 268,540
235 4,764,458
M2-A2 10 126,531
M2-A3 21 400,615
M2-A4 44 954,093
M2-A5 41 1,651,378
M2-A6 92 1,363,301
M3
M3-A1 184 435,564
920 10,045,642
M3-A2 16 455,614
M3-A3 34 875,483
M3-A4 58 1,358,458
M3-A5 52 478,781
M3-A6 59 3,214,290
M3-A7 6 20,415
M3-A8 248 1,343,706
M3-A9 219 838,533
M3-A10 44 1,024,798
Sambit Banerjee Page 4
POC Activities
The POC activities, at a high level, included the following –
a) Build Neo4j environment and a Neo4j Graph database in AWS as it was quicker to adjust the size of the Neo4j
database and runtime environment after different tests.
b) Develop 20+ ETL processes to extract the target dataset (17 tables and 356 columns - with appropriate data
masking) from the legacy Oracle database – for both initial and incremental data.
c) Develop 40+ Unix Shell and Cypher scripts to load initial and incremental data in the Neo4j graph database.
d) Develop Cypher query with the same logic as the Oracle SQL query. The complexity of this query involves multi-
level equi-joins and outer joins on 8 entities (i.e., Oracle tables / Graph node types), evaluation and transformation
of data items in the filters and expressions, analytical functions to rank records within subgroups, union, multi-
column sorting with uniqueness. [ref: Appendix A & B]
e) Develop Unix shell scripts and Python programs to run the Cypher query to handle various query parameters and
different degrees of concurrency (e.g., multi-threading), extensive logging of runtime statistics, capture and
consolidate test results. Developing these scripts and programs were needed in lieu of using LoadRunner type tools.
[Why? It’s a different story!].
f) Extract data from the legacy Oracle database and load the same in the Neo4j graph database. The initial data
loading exercise spanned a few weeks as certain data issues were found and corrected, some load scripts were fine
tuned, after which the extraction and the load processes started all over – a few times.
g) Conduct multiple tests to run the Cypher query with different parameters, capture performance statistics, and
analyze.
h) Conduct multiple tests to extract and load incremental data, capture performance statistics, and analyze.
Sambit Banerjee Page 5
Neo4j POC Environment
Hardware:
• Single instance Neo4j Graph database hosted in an AWS EC2 instance
• AWS EC2 instance:- m5d.24xlarge - 384 GB RAM, 96 cores vCPU, 5.3 TB SSD & NVMe disks
Software:
• Neo4j 3.5.3 Enterprise Edition
• Python 3.6
o Used for developing test driver programs to orchestrate concurrent executions of a large number of
Cypher query and update scripts against the Neo4j graph database, with extensive logging and
consolidation of test results
Neo4j Graph Instance:
• Neo4j JVM heap = 31 GB
• Neo4j Pagecache = 317 GB
• Size of the Neo4j data store (on disk) after loading initial test data = 2.4 TB
Sambit Banerjee Page 6
Neo4j POC Graph Data Elements by numbers, as loaded
Type of Nodes 17
Type of Relationships 27
Number of Nodes 4,880,036,997
Number of Relationships 6,375,650,061
Number of Properties 356
Node Label Count
TxnType1 98,409,635
Exception 7,776,858
LkUp 175
TxnType2 1,031,355,856
AcctState 1,132,603,504
AcctMap 17,146,692
AcctSmry 17,146,692
AcctAttrib 25,543,289
Account 17,146,692
TxnType3 1,267,954,368
Customer 93,838
Personnel 86,136
TxnType4 15,697,975
TxnType5 1,213,105,143
Calendar 34,555
AcctProp 17,146,692
LostTxns 17,146,692
Relationship Type Count
Reln_E 1,642,121
Reln_F 98,409,635
Reln_G 7,776,858
Reln_D1 123,251
Reln_D2 7,776,858
Reln_H 1,132,603,504
Reln_I 1,132,603,504
Reln_J 17,146,692
Reln_K 17,146,692
Reln_B2 2,045,166
Reln_C1 17,146,692
Reln_C2 2,045,166
Reln_L 7,776,858
Reln_M 7,776,858
Reln_B1 17,146,692
Reln_N 25,543,289
Reln_O 2,044,778
Reln_P 17,146,692
Reln_A2 5,513
Reln_Q 1,031,355,856
Reln_A1 111,738
Reln_R 1,213,105,143
Reln_S 15,697,975
Reln_T 1,565,134,368
Reln_U 17,146,692
Reln_V 17,146,692
Reln_W 2,044,778
Sambit Banerjee Page 7
Neo4j POC Graph Data Model
Sambit Banerjee Page 8
Neo4j POC – Cypher Query Test
Test 1
 10 concurrent sessions –
• 1 session for 1
Manager for all
Accounts in the
corresponding
portfolio
• 9 sessions for 9
Agents for all
Accounts in their
individual
portfolios
 These tests were
conducted with all of the
target data pre-cached
in memory as well as
with partial target data
cached in memory
Sambit Banerjee Page 9
Neo4j POC – Cypher Query Test (contd..)
Test 2
 3 concurrent sessions
with 3 Managers running
the query concurrently
for all Accounts in their
individual portfolios
 These tests were
conducted with all of the
target data pre-cached
in memory as well as
with partial target data
cached in memory
Sambit Banerjee Page 10
Neo4j POC – Cypher Query Test (contd..)
Test 3
 15 concurrent sessions with 15 Agents running the query concurrently for all Accounts in their individual portfolios
 These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory
Sambit Banerjee Page 11
Neo4j POC – Cypher Query Test (contd..)
Test 4
 26 concurrent sessions
• 3 sessions for 3 Managers for all Accounts in their corresponding portfolios
• 23 sessions for 23 Agents for all Accounts in their individual portfolios
 These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory
Sambit Banerjee Page 12
Neo4j POC – Cypher Query Test – Conclusion
Overall, the Neo4j Graph Query tests performed much better than the expectation.
Comparing the query performance of a single instance Neo4j Graph database with that of the legacy Oracle database, it was
observed that -
 All Neo4j Cypher queries, under different degrees of concurrencies, completed in less than 25 seconds. Most of the Cypher
queries completed in less than 5 seconds, with ‘All’ or ‘Partial’ target dataset cached in memory
 With ‘All’ target dataset cached in memory (equivalent of Oracle warm cache), Neo4j performed consistently under 15
seconds, whereas it took ~46 seconds for the corresponding Oracle SQL query to complete with warm cache in the legacy
Oracle database
 With some of the target dataset cached in memory (equivalent of Oracle cold cache), Neo4j performed much better than
Oracle. All Neo4j Cypher queries with cold caching performed consistently under 20 seconds, whereas the same query with
cold cache in the legacy Oracle database completed in 6 to 8 minutes. [Note – the legacy Oracle database was hosted on a
physical server with 990 GB RAM and 40 physical CPUs)
 CPU utilization of the AWS EC2 instance hosting the Neo4j Graph database during these tests was low. Out of the 96 vCPUs,
the total CPU consumption of that EC2 instance didn’t exceed 20%, even for the test case with 26 concurrent query
sessions.
Sambit Banerjee Page 13
Neo4j POC – Graph Update Test
Identifying Incremental Data
A certain type of business transaction, that made almost 85% of all business transactions carried out in the legacy Oracle database,
was selected in order to evaluate the performance of updating the Neo4j Graph database with incremental data generated in the
legacy Oracle database.
• This was to ensure that the ‘Graph Update use case’ represented the high volume business transactions at a sustained minimum
peak rate of 650 business transactions / second during the daily peak window of the corresponding legacy system.
• 1 business transaction of this type consisted of multiple SQL insert and update operations to 9 Oracle tables corresponding to the
Neo4j node labels – Account, AcctMap, AccState, AcctSmry, TxnType2, TxnType3, TxnType5, LostTxns, Exception.
Collecting Incremental Data
• Data related to 2.9 million business transactions of the selected type was collected from the busiest 4 hours window of the
corresponding legacy production system.
• In order to maintain ACID compliance in the Neo4j Graph database with respect to the corresponding business transactions of
the legacy system, the test data was packaged into a unit called ‘Transaction Group’, where 1 ‘Transaction Group’ contained 1000
business transactions of the selected type. The following table shows the distribution of Oracle records in 1 ‘Transaction Group’ -
1 Transaction Group = 1000 Business Transaction
records from the legacy Oracle database
Min Max Average
# of records for Oracle SQL Insert operations 70 1494 788
# of records for Oracle SQL Update operations 729 16482 8242
Sambit Banerjee Page 14
Neo4j POC – Graph Update Test (contd..)
Equivalent Neo4j Graph operations to consume Incremental Data
As the Neo4j Graph data model was created exactly the same as the legacy Oracle data model, the following Neo4j
Graph operations took place for updating the Neo4j Graph data for 1 Business Transaction of the selected type -
Total Neo4j Graph operations for the target
of 650 Business Transactions per second
3,250 117,000 5,850 3,900 7,150
Neo4j Graph operations for 1 Business Transaction
Node Label
Create
Node
Attributes
Created per
New Node
Create
Relationships
Update
Node
Attributes
Changed per
Update
Account 1 1
AcctMap 1 3
AcctState 1 81 2 1 2
AcctSmry 1 1
TxnType2 1 6 1
TxnType3 1 18 1
TxnType5 1 6 1
LostTxns 1 2
Exception 1 69 4 1 2
Total 5 180 9 6 11
Sambit Banerjee Page 15
Neo4j POC – Graph Update Test (contd..)
Tests for Updating the Neo4j Graph Database with Incremental Data:
 Multiple tests were conducted to load the incremental data in the Neo4j Graph database, sequentially (i.e., the test
driver program running the tests in a single-thread) as well as with varying degree of parallelism (i.e., the test driver
program running the tests via multi-threaded concurrent child processes).
 Update tests were carried out with a subset of the test data (500 Transaction Groups) , and then with all test data (i.e.,
2,909 Transaction Groups for 2.9 million Business Transactions)
 The Transaction Groups for each test run were distributed equally by the test driver program to all concurrently running
threads at any given point of time.
 Each running thread pre-established dedicated connection to the Neo4j Graph database in order to have their own
dedicated sessions to run the corresponding Neo4j Cypher insert/update statements for the Business Transactions
allocated to them.
 Due to the mutually exclusive nature the of Business Transactions, each thread ran independent of other parallel
threads without having any sort of application induced contention among each other.
 Neo4j Cypher statements (insert and update) were fine tuned a few times in order to improve the performance and
achieve the result as shown in the next few pages.
Sambit Banerjee Page 16
Neo4j POC – Graph Update Test (contd..)
Tests Results:
The following table shows the performance metrics of the Neo4j Graph update tests, which fell well short of the
target of loading 650 Business Transactions per second
Test#
# of Threads
(Degree of
parallelism)
Elapsed Time
(hh:mm:ss)
Elapsed Time
(Seconds)
# of Transaction
Groups
Processed
Total # of Business
Transactions
Processed
# of Business
Transactions processed
per Second
1 1 0:40:49 2,448.69 500 500,000 204
2 10 0:37:42 2,262.03 500 500,000 221
3 50 0:39:36 2,375.91 500 500,000 210
4 100 0:40:28 2,428.37 500 500,000 206
5 1 3:13:21 11,601.48 2,909 2,909,000 251
6 10 2:52:06 10,325.99 2,909 2,909,000 282
7 50 3:02:41 10,960.63 2,909 2,909,000 265
8 100 3:06:14 11,174.25 2,909 2,909,000 260
Sambit Banerjee Page 17
Neo4j POC – Graph Update Test (contd..)
Tests Results (contd..)
The following table is a comparison of the Neo4j Graph operations between the target of 650 Business Transactions
per second and the achieved max of 282 Business Transactions per second
Key observations:
• Although the AWS EC2 instance hosting the single Neo4j Graph database instance had 96 vCPUs, the max CPU
consumption did not exceed 30% of the total CPU capacity during the load of incremental data.
• The degree of parallelism of 10, i.e., updating the Neo4j Graph database via 10 concurrent connections,
achieved the optimal performance for this test.
Neo4j Graph operations
Create
Node
Attributes
Created per
New Node
Create
Relationships
Update
Node
Attributes
Changed per
Update
Target = 650 Business Operations per second 3,250 117,000 5,850 3,900 7,150
Achieved = 282 Business Operations per second 1,410 50,760 2,538 1,692 3,102
Sambit Banerjee Page 18
Neo4j POC – Graph Update Test (contd..)
Performance of Neo4j Graph Insert and Update operations at a glance
This chart shows
the consistency of
performance of the
Neo4j Graph
database insert and
update operations
for most Business
Transactions in
relation to the
corresponding
footprint of
incremental data
packed in those
Business
Transactions.
Sambit Banerjee Page 19
Neo4j POC – Graph Update Test – Conclusion
In summary, the Neo4j Graph Update test for this POC achieved a throughput of 282 Business Transactions per second compared
to the target throughput of 650 Business Transactions per second.
However, in my opinion, Neo4j did reasonably well considering the fact that it was somewhat unfair to Neo4j for imposing the
following key constraints –
1) Neo4j Graph data model was kept the same as the data model of the legacy Oracle database.
• The legacy Oracle data model needed a lot of improvements to operate optimally by itself. So, really can’t blame Neo4j.
• Normally, transition from a RDBMS data model to a Graph data model involves quite a bit of optimization to best realize
the benefit of a Graph database. Due to the criteria set for this POC, no data model optimization was done in this POC.
2) All test data for the selected use case was stored in a single Neo4j Graph instance.
• In contrast, the legacy Oracle database stored the large volume (over 20TB) of data in hundreds of partitions at the file
system level, which would be a key factor for achieving high throughput of write transactions against any database.
• Now, on the other hand, the current architecture of Neo4j does not offer the ability for a single Neo4j Graph instance to
store data in partitions at the file system level. This creates a significant limitation for a single Neo4j Graph instance to
achieve high throughout of write transactions against large data volume.
• However, it is possible to partition a large volume of dataset among multiple Neo4j Graph database instances instead of
a single Neo4j Graph database instance, and, then aggregate resultant datasets from the Neo4j queries, run on those
multiple Neo4j Graph instances, at the application level to meet the business requirements. Undoubtedly, this would
require additional work and infrastructure footprints. [Note:- Neo4j v4.x has introduced similar data partitioning feature
via multiple Neo4j Graph database instances, but it’s not quite there yet in terms of offering all types of out-of-the-box
aggregate / analytical functions that can aggregate data across multiple Neo4j Graph database instances.]
Sambit Banerjee Page 20
Neo4j Graph POC – Final Thoughts
So, what’s the final verdict?
In my observation, it is certainly possible to use a single Neo4j Graph database instance to build a low-latency read-only data
access layer for fast data access by various types of consumers such as APIs, real-time reporting (operational, management, ad-
hoc), analytics, dashboards, etc.
In terms of the performance of concurrent complex queries against large data volume, this POC demonstrated that Neo4j
certainly passed with flying colors.
Hydrating a single Neo4j Graph database instance with bulk incremental data from various sources in near-real time is also
possible, by –
• establishing an optimized data model, especially when transitioning from the legacy RDBMS systems. Keep only those
data attributes in the Neo4j Graph database that are frequently accessed by the consumers of this fast data access layer.
• evaluating the max consumption throughput capacity of the single Neo4j Graph database instance as applicable for the
selected use cases. Use those metrics as among the key considerations for sizing the Neo4j environment (i.e., max
volume of data to store in the Neo4j database instance, CPU and memory, etc.)
• determining optimal patterns and frequencies for loading incremental data from the source systems. Evaluate the usage
patterns of the incremental data, and prioritize the load sequences of the associated nodes / relationships / attributes.
For example, if an incremental dataset contains updates of 50 attributes, and, only 10 of those attributes are accessed
by the consumers in near-real time while the remaining 40 attributes are accessed from the nightly batch/report jobs,
then those 10 attributes may be prioritized for the real-time load, and, a lazy load of the remaining 40 attributes may be
implemented.
With that, goodbye for now and take care!
Sambit Banerjee Page 21
Appendix
Sambit Banerjee Page 22
Appendix A – Oracle SQL Query
Parameters:- {PARAM1}, {PARAM2}
WITH
t1_list AS (SELECT c1 FROM Customer WHERE c1 IN (....) )
, p_dt AS (SELECT TRUNC(NVL(MAX(CAST(a.dt1 AS DATE)),
SYSDATE)) AS c_dt FROM tmp_dt a)
, mgr_agent AS
(SELECT * FROM (SELECT c.c1, c.c2, trim(c.c3) || nvl2(c.c4, ' ' ||
trim(c.c4) || ' ', ' ') || trim(c.c5) as cust_pers_alias1 FROM Customer c
JOIN cust_pers cp ON ( c.k1 = cp.k1 AND cp.type_cd = 2 AND cp.eff_dt
<= trunc(sysdate) AND (trunc(sysdate) < cp.exp_dt OR cp.exp_dt OR
cp.exp_dt is null) ) JOIN Personnel p ON (cp.k1 = c.k1) )
WHERE c.c1 in ("{PARAM1}") AND (:agnt = 'ALL' OR :agnt = agnt_nme)
AND (:mgr = 'ALL' OR :mgr = mgr_nme) )
, cal_rec AS
(SELECT cl.col1, ..., ..., ..., ...,..., colN FROM Calendar cl
WHERE "{PARAM2}" BETWEEN cl.dt1 and dt2 )
, hsr AS
( SELECT bel.k2, ma.c1, ma.c2, ma.c3, ma.c4, bel.c1 AS alias1,
bel.c2 AS alias2, bel.c3 AS alias3, bel.c4 AS alias4, ..., bel.c10 ,
CASE bel.cd1 WHEN 1 THEN 0 + CASE WHEN bel.cd3 IN
(n1, n2, n3, n4, n5, n6) THEN 0 ELSE 4 END
+ CASE WHEN bel.cd8 IN (2,3) THEN 1 WHEN bel.cd8 = 0 THEN 2
WHEN bel.cd8 = 6 THEN 3 ELSE 4 END WHEN 3 THEN 9 END AS
sort_order, ABS(bel.amt1), lt.seq_no1 AS expr_pri,
CASE bel.cd4 WHEN 1 THEN '...' WHEN 3 THEN '...' END AS
cd4_type, lf.cd3 AS someType, CASE bel.cd4 WHEN 1 THEN CASE
WHEN bel.cd7 IN (n1, n2, n3, n4, n5, n6) THEN '...' ELSE '....' END
WHEN 3 THEN CASE WHEN ABS(bel.amt8) <= 1 THEN '.....‘ WHEN
ABS(bel.amt8) <= 100 THEN '.....‘ WHEN ABS(bel.amt8) <= 1000
THEN '.....‘ ELSE '.....‘ END END AS cat_2
FROM mgr_agent ma JOIN Customer c ON (ma.c2 = c.c2) JOIN
Exception bel ON (sa.c2 = bel.c2) JOIN Account a ON
(a.c1 = bel.c1) JOIN cal_rec cr ON (cr.id1 = bel.id1)
JOIN AcctMap am ON (am.id1 = a.id1) JOIN AcctAttrib lf ON
(lf.id2 = am.id2) JOIN TxnType3 tt ON (bel.id4 = tt.id4)
WHERE bel.cd9 = 6 AND bel.cd7 = 1 AND (bel.cd4 = 1 OR bel.cd4 = 3)
AND ( (:sType = 'ALL') OR (lf.s_cd IN ( SELECT lst1.s_cd FROM
lkup1 lst1 WHERE lst1.s_desc = :sType AND rownum > 0 ) ) )
AND ( (:rType = 'ALL') OR (a.r_cd IN ( SELECT lrt1.r_cd FROM
lkup2 lrt1 WHERE lrt1.r_desc = :sType AND rownum > 0 ) ) )
AND bel.id3 = bel.id5 )
, lost_txns AS
(SELECT a.id1, ma.c1, ma.c2, ma.c3, ma.c4, a.id2, ma.c7 AS alias1,
a.c2 AS alias2, NULL AS alias3, 7 AS alias4, '.....' AS alias5,
CASE WHEN (dt1 > dt11 AND dt1 <= r_e_dt) THEN CASE
WHEN (a.dt1 >= cr.dt2 ) THEN '1...‘ WHEN (at_rcl.dt1 >=
cr.dt4 AND at_rcl.t_cd = 33 ) THEN '2....‘ WHEN (at_rcl.dt1 >=
cr.dt4 AND at_rcl.t_cd = 32 ) THEN '3....‘ WHEN (at_rcl.dt1 >=
cr.dt4 AND at_rcl.t_cd = IN (22, 26) ) THEN '4....‘ ELSE
'5....‘END ELSE NULL END AS alias6, NULL AS alias7, NULL AS
alias8, asmry.id3 AS alias9, 10 AS alias10, NULL AS alias11,
NULL AS excp_pri, '.....' AS pr_type, lxn.sType_cd AS sType,
lxn.rType_cd AS rType, CASE WHEN lxn.sType_cd IN (2,3)
THEN '11...' ELSE '12....' END AS catg_1
FROM mgr_agnt ma,
CROSS JOIN p_dt pd JOIN LostTxns lxn ON (ma.id1 = lxn.id4)
JOIN AcctSmry asmry ON (lxn.id1 = lcrps.id1) JOIN calc_rec cr
ON (asmry.id1 = cr.id1 AND pd.c_dt > cr.dt3 AND pd.c_dt <=
cr.dt4) JOIN Account a ON (lxn.id1 = a.id1 AND a.cd10 = 1)
LEFT OUTER JOIN TxnType1 at_rcl ON (lxn.id1 = at_rcl.id2
AND at_rcl.cd3 IN (m1, m2, m3, m4) AND at_rcl.cd6 = 14
AND at_rcl.dt2 BETWEEN cr.st_dt AND cr.e_dt )
WHERE lxn.ind2 = 'N' AND asmry.ind2 = 'N' AND lxn.cd9 <> 2 AND
( (:sType = 'ALL') OR (lxn.sType_cd IN (SELECT lst1.s_cd FROM
lkup1 lst1 WHERE lst1.s_desc = :sType AND rownum > 0 ) ) )
AND ((:rType = 'ALL') OR (a.r_cd IN ( SELECT lrt1.r_cd FROM
lkup2 lrt1 WHERE lrt1.r_desc = :sType AND rownum > 0 ) ) ) )
, all_ex AS
(SELECT a.* FROM hsr a
UNION ALL
SELECT b.*, aliasN as catg_2 FROM lost_txns b )
/* Main query */
SELECT col1, col2, col3,..., substr(catg_1, 3) AS catg_1,
substr(catg_2, 2) AS catg_2, count(*) AS num_accts,
count(distinct ae2.id1) as num_distinct_accts,
substr(catg_1, 1, 2) as catg_1_order,
substr(catg_2, 1, 1) as catg_2_order
FROM (SELECT ae.*,
ROW_NUMBER() OVER (PARTITION BY ae.id1
ORDER BY ae.excp_pri DESC NULLS LAST) as rn
FROM all_ex ae) ae2
WHERE ae2.rn = 1 AND
(:eType = 'ALL' OR :eType = ae2.pr_type)
GROUP BY col1, col2, col3, col4, col5,
substr(catg_1, 1, 2), substr(catg_2, 1, 1);
NOTE:- This SQL query was actually 6 pages long. I have
shortened it to fit here by omitting a lot of column names
and textual values (re: ‘…’ items), replacing some of the
actual column names with fictitious names, etc. Same
applies for the Neo4j Cypher query in Appendix – B.
Sambit Banerjee Page 23
Appendix B – Neo4j Cypher Query
CALL apoc.cypher.run("
MATCH (ps:Personnel)-[:Reln_A2]->(c:Customer)-[:Reln_W]->(lxn:LostTxns)<-[:Reln_O]-(a:Account)-[:Reln_B1]-> (asmry:AcctSmry)-[:Reln_C2]->(cs:Calendar)
WHERE ('ALL' IN $mgr_list_in OR c.c_no IN $mgr_list_in) AND ('ALL' = $mgr_in OR ps.mgr_name = $mgr_in) AND ('ALL' = $agnt_in OR CASE WHEN ps.c4 IS NULL OR ps.c4 = '‘ THEN trim(ps.c3) + ' ‘
+ trim(ps.c5) ELSE trim(ps.c3) + ' ' + trim(ps.c4) + ' ' + trim(ps.c5) END = $agnt_in) AND (cs.i_e_dt <= $dt_in AND $dt_in < cs.e_dt)
OPTIONAL MATCH (a)-[:Reln_E]->(at:TxnType1)
WHERE (cs.s_dt <= at.dt1 AND at.dt1 < cs.e_dt)
RETURN a.id1 AS id1, -1 AS excp_pri, c.col1 AS col1, ps.c7 AS agnt_nme, ps.c8 AS mgr_nme, a.id2 AS id2, c.c_no, 0 AS pr_type, CASE WHEN (a.dt3 >= cs.s_dt) THEN {srt: 1, catg_label: '....'}
WHEN (at.dt1 >= cs.s_dt AND at.t_cd = 33) THEN {srt: 2, catg_label: '....'} WHEN (at.dt1 >= cs.s_dt AND at.t_cd = 32) THEN {srt: 3, catg_label: '....'}
WHEN (at.dt1 >= cs.s_dt AND at.t_cd IN [22,26]) THEN {srt: 4, catg_label: '....'} ELSE {srt: 5, catg_label: '....'} END AS catg_1, {srt: 1, catg_label: ''} AS catg_2, a.r_cd AS r_cd
UNION ALL
MATCH cs_bel_lt_paths=(cs:Calendar)<-[:Reln_G]-(bel:Exception)-[:Reln_D1]->(lt:TxnType3)
WHERE (cs.s_dt <= $dt_in AND $dt_in < cs.e_dt) AND bel.id4 = bel.id7
WITH lt.id1 AS id1, cs_bel_lt_paths
ORDER BY lt.s_no DESC
WITH id1, COLLECT(cs_bel_lt_paths)[..1] AS ltst_cs_bel_lt_paths
UNWIND ltst_cs_bel_lt_paths AS p
WITH id1, nodes(p) AS ns
WITH id1, ns[0] AS cs, ns[1] as bel, ns[2] as lt
MATCH (lt)<-[:Reln_T]-(a:Account)<-[:Reln_J]-(am:AcctMap)-[:Reln_K]->(lf:AcctAttrib), (am)<-[:Reln_U]-(c:Customer)<-[:Reln_A2]-(ps:Personnel)
WHERE ('ALL' = $mgr_in OR ps.mgr_name = $mgr_in) AND ('ALL' = $agnt_in OR
CASE WHEN ps.c4 IS NULL OR ps.c4 = '‘ THEN trim(ps.c3) + ' ' + trim(ps.c5) ELSE trim(ps.c3) + ' ' + trim(ps.c4) + ' ' + trim(ps.c5) END = $agnt_in)
RETURN lt.id1 AS id1, lt.s_no AS excp_pri, c.nme AS nme, ps.c7 AS agnt_nme, ps.c8 AS mgr_nme, a.id2 AS id2, c.c_no, bel.l_cd AS pr_type, CASE bel.l_cd WHEN 1 THEN CASE WHEN bel.cd3 IN
[n1, n2, n3, n4, n5, n6] THEN {srt: 21, catg_label: '....'} ELSE {srt: 22, catg_label: '....'} END WHEN 3 THEN CASE WHEN ABS(bel.amt3) <= 1 THEN {srt: 31, catg_label: '....'} WHEN ABS(bel.amt3) <= 100
THEN {srt: 32, catg_label: '....'} WHEN ABS(bel.amt3) <= 1000 THEN {srt: 33, catg_label: '....'} ELSE {srt: 34, catg_label: '....'} END END AS catg_1, CASE bel.l_cd WHEN 1 THEN CASE WHEN bel.s_cd IN [2, 3]
THEN {srt: 1, catg_label: '....'} ELSE {srt: 2, catg_label: '....'} END END AS catg_2, a.r_cd AS r_cd
", {dt_in:date($dt), mgr_list_in:$mgr_list, mgr_in: $mgr, agnt_in: $agnt}) yield value
WITH value AS v
ORDER BY v.id1, v.excp_pri DESC
WITH v.id1 AS id1, COLLECT(v)[..1] AS v0
UNWIND v0 as row
RETURN row.mgr_nme AS mgr_nme, row.agnt_nme AS agnt_nme, row.c_no AS c_no, row.c_nme AS c_nme, row.excp_pri, row.pr_typ AS excp_type, row.catg_1.catg_label AS catg_1_label,
row.catg_2.catg_label AS catg_2_label, COUNT(*) AS num_accts, COUNT(DISTINCT row.id2) AS num_dist_accts
Sambit Banerjee Page 24

More Related Content

What's hot

Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Denny Lee
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...Noriaki Tatsumi
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paperrestassure
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseMongoDB
 
Skills Portfolio
Skills PortfolioSkills Portfolio
Skills Portfoliorolee23
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInrajappaiyer
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Presentation cmg2016 capacity management essentials-boston
Presentation   cmg2016 capacity management essentials-bostonPresentation   cmg2016 capacity management essentials-boston
Presentation cmg2016 capacity management essentials-bostonMohit Verma
 

What's hot (20)

Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
Ensuring compliance of patient data with big data and bi [bdii 301-m] - (4078)
 
tecFinal 451 webinar deck
tecFinal 451 webinar decktecFinal 451 webinar deck
tecFinal 451 webinar deck
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
What's new in Spark 2.0?
What's new in Spark 2.0?What's new in Spark 2.0?
What's new in Spark 2.0?
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
C1803041317
C1803041317C1803041317
C1803041317
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
 
Skills Portfolio
Skills PortfolioSkills Portfolio
Skills Portfolio
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
A Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data SystemsA Performance Study of Big Spatial Data Systems
A Performance Study of Big Spatial Data Systems
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Presentation cmg2016 capacity management essentials-boston
Presentation   cmg2016 capacity management essentials-bostonPresentation   cmg2016 capacity management essentials-boston
Presentation cmg2016 capacity management essentials-boston
 

Similar to Exploring Neo4j Graph Database as a Fast Data Access Layer

Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
big-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfbig-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfssuserd397dd
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
 
Demantra Case Study Doug
Demantra Case Study DougDemantra Case Study Doug
Demantra Case Study Dougsichie
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET Journal
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapNeo4j
 
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB
 
Sql portfolio admin_practicals
Sql portfolio admin_practicalsSql portfolio admin_practicals
Sql portfolio admin_practicalsShelli Ciaschini
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_systemJithin Zcs
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 

Similar to Exploring Neo4j Graph Database as a Fast Data Access Layer (20)

Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Resume
ResumeResume
Resume
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
big-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdfbig-book-of-data-science-2ndedition.pdf
big-book-of-data-science-2ndedition.pdf
 
Chaitanya_updated resume
Chaitanya_updated resumeChaitanya_updated resume
Chaitanya_updated resume
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Chaitanya_updated resume
Chaitanya_updated resumeChaitanya_updated resume
Chaitanya_updated resume
 
Demantra Case Study Doug
Demantra Case Study DougDemantra Case Study Doug
Demantra Case Study Doug
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database Techniques
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
 
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
 
Sql portfolio admin_practicals
Sql portfolio admin_practicalsSql portfolio admin_practicals
Sql portfolio admin_practicals
 
Enterprise resource planning_system
Enterprise resource planning_systemEnterprise resource planning_system
Enterprise resource planning_system
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Exploring Neo4j Graph Database as a Fast Data Access Layer

  • 1. Exploring Neo4j Graph Database to build a Fast Data Access Layer with Near-Real time Data Ingestion Sambit Banerjee 05-April-2020
  • 2. Overview The future state design considerations for the modernization initiatives of large scale legacy systems often require addressing various patterns for accessing data from the backend data sources in near-real time by APIs, reports & queries, dashboards, etc. Some of the common solutions for such requirements include using – • replicated reporting databases with optimized data model where data is transformed after replication from the source databases • data virtualization (with some degree of local caching of data) / data federation These solutions work in a lot of cases depending on the degrees of acceptances from the stakeholders. However, besides the challenges to fulfil the requirement of accessing data in near-real time, all of these solutions have certain limitations in terms of initial implementation efforts, impacts to the consumer systems, and, and on-going management of the same. While working on such a modernization initiative for a large scale legacy system for one of my clients a few months ago, exploring avenues for a different approach to address some of these challenges was well in order, and, it was decided to conduct an extensive POC with the Neo4j Graph database. It was a great exercise for me to have explored the Neo4j Graph database in deep details. The major part of the overall outcome was pleasantly favorable, although some limitations were observed as well. This document explains the same. NOTE:- In order to protect the business information of my client, I have used placeholders / fictitious names while describing the use case, data model, data attributes, queries, etc. in this document. Sambit Banerjee Page 2
  • 3. POC Objective  Investigate the feasibility of using Neo4j Graph database to establish a low-latency read-only data access layer with the following characteristics – 1) the data from the data access layer is used by different consumers across the enterprise such as APIs, real-time reporting (operational, management, ad-hoc), analytics, dashboards, and many more. 2) the data is structured, and, the data access layer is continuously hydrated from multiple backend RDBMS (including large legacy databases) in near-real time, e.g., within 2-5 minutes of the source databases making the incremental dataset available to the data access layer for consumption. The volume of the incremental data can be quite large, e.g., 3 million business transactions generated in a few minutes, during the peak usage of the source systems. 3) maintain the data model of the data access layer as close as the source data models so that the codebases of the existing reports and queries for APIs don’t have to go through complete or significant overhaul 4) the data access layer can support high performance complex queries (e.g., many joins, filters, sorts, grouping, etc.) against large volume of structured data, and, produce the resultant dataset to the consumers in sub-seconds time frame  Demonstrate a complete use case with a high performance query run against the full data volume taken from an existing large legacy production system Sambit Banerjee Page 3
  • 4. POC Use Case Department XYZ has Managers and Agents to manage the Accounts of millions of Customers. The Agents and the Managers run an operational report multiple times throughout a business day to monitor the status of various transactions on the Customer accounts, takes appropriate actions, and reports the same to the sr. management. The distribution of the subset of overall operational workload within the XYZ department is shown in this table  Each execution of the selected operational report (based on an Oracle SQL query – ref: Appendix A) runs against billions of records in the corresponding Oracle database of the legacy system, and, typically completes in 6 to 8 minutes. The goal of the POC was to – a) build the same volume of dataset, while keeping the same data model, in a Neo4j Graph database. b) replicate the same Oracle SQL query in Neo4j Cypher query (ref: Appendix B) with the same logic, and evaluate the performance of the Neo4j Cypher query against that of the Oracle SQL query. In order to replicate the similar operational scenario, the Neo4j query should run with different degrees of session concurrency, with each session representing either a Manager or an Agent. c) after loading the initial data volume in the Neo4j Graph database, add to it the incremental data generated in the legacy Oracle database during the peak processing window, and, assess the load performance of the incremental data in Neo4j Graph database. Manager Agent Customer Count by Agent Account Count by Agent Customer Count by Manager Account Count by Manager M1 M1-A1 151 313,692 1,081 2,336,592 M1-A2 211 441,344 M1-A3 200 441,744 M1-A4 203 373,115 M1-A5 154 211,451 M1-A6 23 7,816 M1-A7 139 547,430 M2 M2-A1 27 268,540 235 4,764,458 M2-A2 10 126,531 M2-A3 21 400,615 M2-A4 44 954,093 M2-A5 41 1,651,378 M2-A6 92 1,363,301 M3 M3-A1 184 435,564 920 10,045,642 M3-A2 16 455,614 M3-A3 34 875,483 M3-A4 58 1,358,458 M3-A5 52 478,781 M3-A6 59 3,214,290 M3-A7 6 20,415 M3-A8 248 1,343,706 M3-A9 219 838,533 M3-A10 44 1,024,798 Sambit Banerjee Page 4
  • 5. POC Activities The POC activities, at a high level, included the following – a) Build Neo4j environment and a Neo4j Graph database in AWS as it was quicker to adjust the size of the Neo4j database and runtime environment after different tests. b) Develop 20+ ETL processes to extract the target dataset (17 tables and 356 columns - with appropriate data masking) from the legacy Oracle database – for both initial and incremental data. c) Develop 40+ Unix Shell and Cypher scripts to load initial and incremental data in the Neo4j graph database. d) Develop Cypher query with the same logic as the Oracle SQL query. The complexity of this query involves multi- level equi-joins and outer joins on 8 entities (i.e., Oracle tables / Graph node types), evaluation and transformation of data items in the filters and expressions, analytical functions to rank records within subgroups, union, multi- column sorting with uniqueness. [ref: Appendix A & B] e) Develop Unix shell scripts and Python programs to run the Cypher query to handle various query parameters and different degrees of concurrency (e.g., multi-threading), extensive logging of runtime statistics, capture and consolidate test results. Developing these scripts and programs were needed in lieu of using LoadRunner type tools. [Why? It’s a different story!]. f) Extract data from the legacy Oracle database and load the same in the Neo4j graph database. The initial data loading exercise spanned a few weeks as certain data issues were found and corrected, some load scripts were fine tuned, after which the extraction and the load processes started all over – a few times. g) Conduct multiple tests to run the Cypher query with different parameters, capture performance statistics, and analyze. h) Conduct multiple tests to extract and load incremental data, capture performance statistics, and analyze. Sambit Banerjee Page 5
  • 6. Neo4j POC Environment Hardware: • Single instance Neo4j Graph database hosted in an AWS EC2 instance • AWS EC2 instance:- m5d.24xlarge - 384 GB RAM, 96 cores vCPU, 5.3 TB SSD & NVMe disks Software: • Neo4j 3.5.3 Enterprise Edition • Python 3.6 o Used for developing test driver programs to orchestrate concurrent executions of a large number of Cypher query and update scripts against the Neo4j graph database, with extensive logging and consolidation of test results Neo4j Graph Instance: • Neo4j JVM heap = 31 GB • Neo4j Pagecache = 317 GB • Size of the Neo4j data store (on disk) after loading initial test data = 2.4 TB Sambit Banerjee Page 6
  • 7. Neo4j POC Graph Data Elements by numbers, as loaded Type of Nodes 17 Type of Relationships 27 Number of Nodes 4,880,036,997 Number of Relationships 6,375,650,061 Number of Properties 356 Node Label Count TxnType1 98,409,635 Exception 7,776,858 LkUp 175 TxnType2 1,031,355,856 AcctState 1,132,603,504 AcctMap 17,146,692 AcctSmry 17,146,692 AcctAttrib 25,543,289 Account 17,146,692 TxnType3 1,267,954,368 Customer 93,838 Personnel 86,136 TxnType4 15,697,975 TxnType5 1,213,105,143 Calendar 34,555 AcctProp 17,146,692 LostTxns 17,146,692 Relationship Type Count Reln_E 1,642,121 Reln_F 98,409,635 Reln_G 7,776,858 Reln_D1 123,251 Reln_D2 7,776,858 Reln_H 1,132,603,504 Reln_I 1,132,603,504 Reln_J 17,146,692 Reln_K 17,146,692 Reln_B2 2,045,166 Reln_C1 17,146,692 Reln_C2 2,045,166 Reln_L 7,776,858 Reln_M 7,776,858 Reln_B1 17,146,692 Reln_N 25,543,289 Reln_O 2,044,778 Reln_P 17,146,692 Reln_A2 5,513 Reln_Q 1,031,355,856 Reln_A1 111,738 Reln_R 1,213,105,143 Reln_S 15,697,975 Reln_T 1,565,134,368 Reln_U 17,146,692 Reln_V 17,146,692 Reln_W 2,044,778 Sambit Banerjee Page 7
  • 8. Neo4j POC Graph Data Model Sambit Banerjee Page 8
  • 9. Neo4j POC – Cypher Query Test Test 1  10 concurrent sessions – • 1 session for 1 Manager for all Accounts in the corresponding portfolio • 9 sessions for 9 Agents for all Accounts in their individual portfolios  These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory Sambit Banerjee Page 9
  • 10. Neo4j POC – Cypher Query Test (contd..) Test 2  3 concurrent sessions with 3 Managers running the query concurrently for all Accounts in their individual portfolios  These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory Sambit Banerjee Page 10
  • 11. Neo4j POC – Cypher Query Test (contd..) Test 3  15 concurrent sessions with 15 Agents running the query concurrently for all Accounts in their individual portfolios  These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory Sambit Banerjee Page 11
  • 12. Neo4j POC – Cypher Query Test (contd..) Test 4  26 concurrent sessions • 3 sessions for 3 Managers for all Accounts in their corresponding portfolios • 23 sessions for 23 Agents for all Accounts in their individual portfolios  These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory Sambit Banerjee Page 12
  • 13. Neo4j POC – Cypher Query Test – Conclusion Overall, the Neo4j Graph Query tests performed much better than the expectation. Comparing the query performance of a single instance Neo4j Graph database with that of the legacy Oracle database, it was observed that -  All Neo4j Cypher queries, under different degrees of concurrencies, completed in less than 25 seconds. Most of the Cypher queries completed in less than 5 seconds, with ‘All’ or ‘Partial’ target dataset cached in memory  With ‘All’ target dataset cached in memory (equivalent of Oracle warm cache), Neo4j performed consistently under 15 seconds, whereas it took ~46 seconds for the corresponding Oracle SQL query to complete with warm cache in the legacy Oracle database  With some of the target dataset cached in memory (equivalent of Oracle cold cache), Neo4j performed much better than Oracle. All Neo4j Cypher queries with cold caching performed consistently under 20 seconds, whereas the same query with cold cache in the legacy Oracle database completed in 6 to 8 minutes. [Note – the legacy Oracle database was hosted on a physical server with 990 GB RAM and 40 physical CPUs)  CPU utilization of the AWS EC2 instance hosting the Neo4j Graph database during these tests was low. Out of the 96 vCPUs, the total CPU consumption of that EC2 instance didn’t exceed 20%, even for the test case with 26 concurrent query sessions. Sambit Banerjee Page 13
  • 14. Neo4j POC – Graph Update Test Identifying Incremental Data A certain type of business transaction, that made almost 85% of all business transactions carried out in the legacy Oracle database, was selected in order to evaluate the performance of updating the Neo4j Graph database with incremental data generated in the legacy Oracle database. • This was to ensure that the ‘Graph Update use case’ represented the high volume business transactions at a sustained minimum peak rate of 650 business transactions / second during the daily peak window of the corresponding legacy system. • 1 business transaction of this type consisted of multiple SQL insert and update operations to 9 Oracle tables corresponding to the Neo4j node labels – Account, AcctMap, AccState, AcctSmry, TxnType2, TxnType3, TxnType5, LostTxns, Exception. Collecting Incremental Data • Data related to 2.9 million business transactions of the selected type was collected from the busiest 4 hours window of the corresponding legacy production system. • In order to maintain ACID compliance in the Neo4j Graph database with respect to the corresponding business transactions of the legacy system, the test data was packaged into a unit called ‘Transaction Group’, where 1 ‘Transaction Group’ contained 1000 business transactions of the selected type. The following table shows the distribution of Oracle records in 1 ‘Transaction Group’ - 1 Transaction Group = 1000 Business Transaction records from the legacy Oracle database Min Max Average # of records for Oracle SQL Insert operations 70 1494 788 # of records for Oracle SQL Update operations 729 16482 8242 Sambit Banerjee Page 14
  • 15. Neo4j POC – Graph Update Test (contd..) Equivalent Neo4j Graph operations to consume Incremental Data As the Neo4j Graph data model was created exactly the same as the legacy Oracle data model, the following Neo4j Graph operations took place for updating the Neo4j Graph data for 1 Business Transaction of the selected type - Total Neo4j Graph operations for the target of 650 Business Transactions per second 3,250 117,000 5,850 3,900 7,150 Neo4j Graph operations for 1 Business Transaction Node Label Create Node Attributes Created per New Node Create Relationships Update Node Attributes Changed per Update Account 1 1 AcctMap 1 3 AcctState 1 81 2 1 2 AcctSmry 1 1 TxnType2 1 6 1 TxnType3 1 18 1 TxnType5 1 6 1 LostTxns 1 2 Exception 1 69 4 1 2 Total 5 180 9 6 11 Sambit Banerjee Page 15
  • 16. Neo4j POC – Graph Update Test (contd..) Tests for Updating the Neo4j Graph Database with Incremental Data:  Multiple tests were conducted to load the incremental data in the Neo4j Graph database, sequentially (i.e., the test driver program running the tests in a single-thread) as well as with varying degree of parallelism (i.e., the test driver program running the tests via multi-threaded concurrent child processes).  Update tests were carried out with a subset of the test data (500 Transaction Groups) , and then with all test data (i.e., 2,909 Transaction Groups for 2.9 million Business Transactions)  The Transaction Groups for each test run were distributed equally by the test driver program to all concurrently running threads at any given point of time.  Each running thread pre-established dedicated connection to the Neo4j Graph database in order to have their own dedicated sessions to run the corresponding Neo4j Cypher insert/update statements for the Business Transactions allocated to them.  Due to the mutually exclusive nature the of Business Transactions, each thread ran independent of other parallel threads without having any sort of application induced contention among each other.  Neo4j Cypher statements (insert and update) were fine tuned a few times in order to improve the performance and achieve the result as shown in the next few pages. Sambit Banerjee Page 16
  • 17. Neo4j POC – Graph Update Test (contd..) Tests Results: The following table shows the performance metrics of the Neo4j Graph update tests, which fell well short of the target of loading 650 Business Transactions per second Test# # of Threads (Degree of parallelism) Elapsed Time (hh:mm:ss) Elapsed Time (Seconds) # of Transaction Groups Processed Total # of Business Transactions Processed # of Business Transactions processed per Second 1 1 0:40:49 2,448.69 500 500,000 204 2 10 0:37:42 2,262.03 500 500,000 221 3 50 0:39:36 2,375.91 500 500,000 210 4 100 0:40:28 2,428.37 500 500,000 206 5 1 3:13:21 11,601.48 2,909 2,909,000 251 6 10 2:52:06 10,325.99 2,909 2,909,000 282 7 50 3:02:41 10,960.63 2,909 2,909,000 265 8 100 3:06:14 11,174.25 2,909 2,909,000 260 Sambit Banerjee Page 17
  • 18. Neo4j POC – Graph Update Test (contd..) Tests Results (contd..) The following table is a comparison of the Neo4j Graph operations between the target of 650 Business Transactions per second and the achieved max of 282 Business Transactions per second Key observations: • Although the AWS EC2 instance hosting the single Neo4j Graph database instance had 96 vCPUs, the max CPU consumption did not exceed 30% of the total CPU capacity during the load of incremental data. • The degree of parallelism of 10, i.e., updating the Neo4j Graph database via 10 concurrent connections, achieved the optimal performance for this test. Neo4j Graph operations Create Node Attributes Created per New Node Create Relationships Update Node Attributes Changed per Update Target = 650 Business Operations per second 3,250 117,000 5,850 3,900 7,150 Achieved = 282 Business Operations per second 1,410 50,760 2,538 1,692 3,102 Sambit Banerjee Page 18
  • 19. Neo4j POC – Graph Update Test (contd..) Performance of Neo4j Graph Insert and Update operations at a glance This chart shows the consistency of performance of the Neo4j Graph database insert and update operations for most Business Transactions in relation to the corresponding footprint of incremental data packed in those Business Transactions. Sambit Banerjee Page 19
  • 20. Neo4j POC – Graph Update Test – Conclusion In summary, the Neo4j Graph Update test for this POC achieved a throughput of 282 Business Transactions per second compared to the target throughput of 650 Business Transactions per second. However, in my opinion, Neo4j did reasonably well considering the fact that it was somewhat unfair to Neo4j for imposing the following key constraints – 1) Neo4j Graph data model was kept the same as the data model of the legacy Oracle database. • The legacy Oracle data model needed a lot of improvements to operate optimally by itself. So, really can’t blame Neo4j. • Normally, transition from a RDBMS data model to a Graph data model involves quite a bit of optimization to best realize the benefit of a Graph database. Due to the criteria set for this POC, no data model optimization was done in this POC. 2) All test data for the selected use case was stored in a single Neo4j Graph instance. • In contrast, the legacy Oracle database stored the large volume (over 20TB) of data in hundreds of partitions at the file system level, which would be a key factor for achieving high throughput of write transactions against any database. • Now, on the other hand, the current architecture of Neo4j does not offer the ability for a single Neo4j Graph instance to store data in partitions at the file system level. This creates a significant limitation for a single Neo4j Graph instance to achieve high throughout of write transactions against large data volume. • However, it is possible to partition a large volume of dataset among multiple Neo4j Graph database instances instead of a single Neo4j Graph database instance, and, then aggregate resultant datasets from the Neo4j queries, run on those multiple Neo4j Graph instances, at the application level to meet the business requirements. Undoubtedly, this would require additional work and infrastructure footprints. [Note:- Neo4j v4.x has introduced similar data partitioning feature via multiple Neo4j Graph database instances, but it’s not quite there yet in terms of offering all types of out-of-the-box aggregate / analytical functions that can aggregate data across multiple Neo4j Graph database instances.] Sambit Banerjee Page 20
  • 21. Neo4j Graph POC – Final Thoughts So, what’s the final verdict? In my observation, it is certainly possible to use a single Neo4j Graph database instance to build a low-latency read-only data access layer for fast data access by various types of consumers such as APIs, real-time reporting (operational, management, ad- hoc), analytics, dashboards, etc. In terms of the performance of concurrent complex queries against large data volume, this POC demonstrated that Neo4j certainly passed with flying colors. Hydrating a single Neo4j Graph database instance with bulk incremental data from various sources in near-real time is also possible, by – • establishing an optimized data model, especially when transitioning from the legacy RDBMS systems. Keep only those data attributes in the Neo4j Graph database that are frequently accessed by the consumers of this fast data access layer. • evaluating the max consumption throughput capacity of the single Neo4j Graph database instance as applicable for the selected use cases. Use those metrics as among the key considerations for sizing the Neo4j environment (i.e., max volume of data to store in the Neo4j database instance, CPU and memory, etc.) • determining optimal patterns and frequencies for loading incremental data from the source systems. Evaluate the usage patterns of the incremental data, and prioritize the load sequences of the associated nodes / relationships / attributes. For example, if an incremental dataset contains updates of 50 attributes, and, only 10 of those attributes are accessed by the consumers in near-real time while the remaining 40 attributes are accessed from the nightly batch/report jobs, then those 10 attributes may be prioritized for the real-time load, and, a lazy load of the remaining 40 attributes may be implemented. With that, goodbye for now and take care! Sambit Banerjee Page 21
  • 23. Appendix A – Oracle SQL Query Parameters:- {PARAM1}, {PARAM2} WITH t1_list AS (SELECT c1 FROM Customer WHERE c1 IN (....) ) , p_dt AS (SELECT TRUNC(NVL(MAX(CAST(a.dt1 AS DATE)), SYSDATE)) AS c_dt FROM tmp_dt a) , mgr_agent AS (SELECT * FROM (SELECT c.c1, c.c2, trim(c.c3) || nvl2(c.c4, ' ' || trim(c.c4) || ' ', ' ') || trim(c.c5) as cust_pers_alias1 FROM Customer c JOIN cust_pers cp ON ( c.k1 = cp.k1 AND cp.type_cd = 2 AND cp.eff_dt <= trunc(sysdate) AND (trunc(sysdate) < cp.exp_dt OR cp.exp_dt OR cp.exp_dt is null) ) JOIN Personnel p ON (cp.k1 = c.k1) ) WHERE c.c1 in ("{PARAM1}") AND (:agnt = 'ALL' OR :agnt = agnt_nme) AND (:mgr = 'ALL' OR :mgr = mgr_nme) ) , cal_rec AS (SELECT cl.col1, ..., ..., ..., ...,..., colN FROM Calendar cl WHERE "{PARAM2}" BETWEEN cl.dt1 and dt2 ) , hsr AS ( SELECT bel.k2, ma.c1, ma.c2, ma.c3, ma.c4, bel.c1 AS alias1, bel.c2 AS alias2, bel.c3 AS alias3, bel.c4 AS alias4, ..., bel.c10 , CASE bel.cd1 WHEN 1 THEN 0 + CASE WHEN bel.cd3 IN (n1, n2, n3, n4, n5, n6) THEN 0 ELSE 4 END + CASE WHEN bel.cd8 IN (2,3) THEN 1 WHEN bel.cd8 = 0 THEN 2 WHEN bel.cd8 = 6 THEN 3 ELSE 4 END WHEN 3 THEN 9 END AS sort_order, ABS(bel.amt1), lt.seq_no1 AS expr_pri, CASE bel.cd4 WHEN 1 THEN '...' WHEN 3 THEN '...' END AS cd4_type, lf.cd3 AS someType, CASE bel.cd4 WHEN 1 THEN CASE WHEN bel.cd7 IN (n1, n2, n3, n4, n5, n6) THEN '...' ELSE '....' END WHEN 3 THEN CASE WHEN ABS(bel.amt8) <= 1 THEN '.....‘ WHEN ABS(bel.amt8) <= 100 THEN '.....‘ WHEN ABS(bel.amt8) <= 1000 THEN '.....‘ ELSE '.....‘ END END AS cat_2 FROM mgr_agent ma JOIN Customer c ON (ma.c2 = c.c2) JOIN Exception bel ON (sa.c2 = bel.c2) JOIN Account a ON (a.c1 = bel.c1) JOIN cal_rec cr ON (cr.id1 = bel.id1) JOIN AcctMap am ON (am.id1 = a.id1) JOIN AcctAttrib lf ON (lf.id2 = am.id2) JOIN TxnType3 tt ON (bel.id4 = tt.id4) WHERE bel.cd9 = 6 AND bel.cd7 = 1 AND (bel.cd4 = 1 OR bel.cd4 = 3) AND ( (:sType = 'ALL') OR (lf.s_cd IN ( SELECT lst1.s_cd FROM lkup1 lst1 WHERE lst1.s_desc = :sType AND rownum > 0 ) ) ) AND ( (:rType = 'ALL') OR (a.r_cd IN ( SELECT lrt1.r_cd FROM lkup2 lrt1 WHERE lrt1.r_desc = :sType AND rownum > 0 ) ) ) AND bel.id3 = bel.id5 ) , lost_txns AS (SELECT a.id1, ma.c1, ma.c2, ma.c3, ma.c4, a.id2, ma.c7 AS alias1, a.c2 AS alias2, NULL AS alias3, 7 AS alias4, '.....' AS alias5, CASE WHEN (dt1 > dt11 AND dt1 <= r_e_dt) THEN CASE WHEN (a.dt1 >= cr.dt2 ) THEN '1...‘ WHEN (at_rcl.dt1 >= cr.dt4 AND at_rcl.t_cd = 33 ) THEN '2....‘ WHEN (at_rcl.dt1 >= cr.dt4 AND at_rcl.t_cd = 32 ) THEN '3....‘ WHEN (at_rcl.dt1 >= cr.dt4 AND at_rcl.t_cd = IN (22, 26) ) THEN '4....‘ ELSE '5....‘END ELSE NULL END AS alias6, NULL AS alias7, NULL AS alias8, asmry.id3 AS alias9, 10 AS alias10, NULL AS alias11, NULL AS excp_pri, '.....' AS pr_type, lxn.sType_cd AS sType, lxn.rType_cd AS rType, CASE WHEN lxn.sType_cd IN (2,3) THEN '11...' ELSE '12....' END AS catg_1 FROM mgr_agnt ma, CROSS JOIN p_dt pd JOIN LostTxns lxn ON (ma.id1 = lxn.id4) JOIN AcctSmry asmry ON (lxn.id1 = lcrps.id1) JOIN calc_rec cr ON (asmry.id1 = cr.id1 AND pd.c_dt > cr.dt3 AND pd.c_dt <= cr.dt4) JOIN Account a ON (lxn.id1 = a.id1 AND a.cd10 = 1) LEFT OUTER JOIN TxnType1 at_rcl ON (lxn.id1 = at_rcl.id2 AND at_rcl.cd3 IN (m1, m2, m3, m4) AND at_rcl.cd6 = 14 AND at_rcl.dt2 BETWEEN cr.st_dt AND cr.e_dt ) WHERE lxn.ind2 = 'N' AND asmry.ind2 = 'N' AND lxn.cd9 <> 2 AND ( (:sType = 'ALL') OR (lxn.sType_cd IN (SELECT lst1.s_cd FROM lkup1 lst1 WHERE lst1.s_desc = :sType AND rownum > 0 ) ) ) AND ((:rType = 'ALL') OR (a.r_cd IN ( SELECT lrt1.r_cd FROM lkup2 lrt1 WHERE lrt1.r_desc = :sType AND rownum > 0 ) ) ) ) , all_ex AS (SELECT a.* FROM hsr a UNION ALL SELECT b.*, aliasN as catg_2 FROM lost_txns b ) /* Main query */ SELECT col1, col2, col3,..., substr(catg_1, 3) AS catg_1, substr(catg_2, 2) AS catg_2, count(*) AS num_accts, count(distinct ae2.id1) as num_distinct_accts, substr(catg_1, 1, 2) as catg_1_order, substr(catg_2, 1, 1) as catg_2_order FROM (SELECT ae.*, ROW_NUMBER() OVER (PARTITION BY ae.id1 ORDER BY ae.excp_pri DESC NULLS LAST) as rn FROM all_ex ae) ae2 WHERE ae2.rn = 1 AND (:eType = 'ALL' OR :eType = ae2.pr_type) GROUP BY col1, col2, col3, col4, col5, substr(catg_1, 1, 2), substr(catg_2, 1, 1); NOTE:- This SQL query was actually 6 pages long. I have shortened it to fit here by omitting a lot of column names and textual values (re: ‘…’ items), replacing some of the actual column names with fictitious names, etc. Same applies for the Neo4j Cypher query in Appendix – B. Sambit Banerjee Page 23
  • 24. Appendix B – Neo4j Cypher Query CALL apoc.cypher.run(" MATCH (ps:Personnel)-[:Reln_A2]->(c:Customer)-[:Reln_W]->(lxn:LostTxns)<-[:Reln_O]-(a:Account)-[:Reln_B1]-> (asmry:AcctSmry)-[:Reln_C2]->(cs:Calendar) WHERE ('ALL' IN $mgr_list_in OR c.c_no IN $mgr_list_in) AND ('ALL' = $mgr_in OR ps.mgr_name = $mgr_in) AND ('ALL' = $agnt_in OR CASE WHEN ps.c4 IS NULL OR ps.c4 = '‘ THEN trim(ps.c3) + ' ‘ + trim(ps.c5) ELSE trim(ps.c3) + ' ' + trim(ps.c4) + ' ' + trim(ps.c5) END = $agnt_in) AND (cs.i_e_dt <= $dt_in AND $dt_in < cs.e_dt) OPTIONAL MATCH (a)-[:Reln_E]->(at:TxnType1) WHERE (cs.s_dt <= at.dt1 AND at.dt1 < cs.e_dt) RETURN a.id1 AS id1, -1 AS excp_pri, c.col1 AS col1, ps.c7 AS agnt_nme, ps.c8 AS mgr_nme, a.id2 AS id2, c.c_no, 0 AS pr_type, CASE WHEN (a.dt3 >= cs.s_dt) THEN {srt: 1, catg_label: '....'} WHEN (at.dt1 >= cs.s_dt AND at.t_cd = 33) THEN {srt: 2, catg_label: '....'} WHEN (at.dt1 >= cs.s_dt AND at.t_cd = 32) THEN {srt: 3, catg_label: '....'} WHEN (at.dt1 >= cs.s_dt AND at.t_cd IN [22,26]) THEN {srt: 4, catg_label: '....'} ELSE {srt: 5, catg_label: '....'} END AS catg_1, {srt: 1, catg_label: ''} AS catg_2, a.r_cd AS r_cd UNION ALL MATCH cs_bel_lt_paths=(cs:Calendar)<-[:Reln_G]-(bel:Exception)-[:Reln_D1]->(lt:TxnType3) WHERE (cs.s_dt <= $dt_in AND $dt_in < cs.e_dt) AND bel.id4 = bel.id7 WITH lt.id1 AS id1, cs_bel_lt_paths ORDER BY lt.s_no DESC WITH id1, COLLECT(cs_bel_lt_paths)[..1] AS ltst_cs_bel_lt_paths UNWIND ltst_cs_bel_lt_paths AS p WITH id1, nodes(p) AS ns WITH id1, ns[0] AS cs, ns[1] as bel, ns[2] as lt MATCH (lt)<-[:Reln_T]-(a:Account)<-[:Reln_J]-(am:AcctMap)-[:Reln_K]->(lf:AcctAttrib), (am)<-[:Reln_U]-(c:Customer)<-[:Reln_A2]-(ps:Personnel) WHERE ('ALL' = $mgr_in OR ps.mgr_name = $mgr_in) AND ('ALL' = $agnt_in OR CASE WHEN ps.c4 IS NULL OR ps.c4 = '‘ THEN trim(ps.c3) + ' ' + trim(ps.c5) ELSE trim(ps.c3) + ' ' + trim(ps.c4) + ' ' + trim(ps.c5) END = $agnt_in) RETURN lt.id1 AS id1, lt.s_no AS excp_pri, c.nme AS nme, ps.c7 AS agnt_nme, ps.c8 AS mgr_nme, a.id2 AS id2, c.c_no, bel.l_cd AS pr_type, CASE bel.l_cd WHEN 1 THEN CASE WHEN bel.cd3 IN [n1, n2, n3, n4, n5, n6] THEN {srt: 21, catg_label: '....'} ELSE {srt: 22, catg_label: '....'} END WHEN 3 THEN CASE WHEN ABS(bel.amt3) <= 1 THEN {srt: 31, catg_label: '....'} WHEN ABS(bel.amt3) <= 100 THEN {srt: 32, catg_label: '....'} WHEN ABS(bel.amt3) <= 1000 THEN {srt: 33, catg_label: '....'} ELSE {srt: 34, catg_label: '....'} END END AS catg_1, CASE bel.l_cd WHEN 1 THEN CASE WHEN bel.s_cd IN [2, 3] THEN {srt: 1, catg_label: '....'} ELSE {srt: 2, catg_label: '....'} END END AS catg_2, a.r_cd AS r_cd ", {dt_in:date($dt), mgr_list_in:$mgr_list, mgr_in: $mgr, agnt_in: $agnt}) yield value WITH value AS v ORDER BY v.id1, v.excp_pri DESC WITH v.id1 AS id1, COLLECT(v)[..1] AS v0 UNWIND v0 as row RETURN row.mgr_nme AS mgr_nme, row.agnt_nme AS agnt_nme, row.c_no AS c_no, row.c_nme AS c_nme, row.excp_pri, row.pr_typ AS excp_type, row.catg_1.catg_label AS catg_1_label, row.catg_2.catg_label AS catg_2_label, COUNT(*) AS num_accts, COUNT(DISTINCT row.id2) AS num_dist_accts Sambit Banerjee Page 24