This article describes the findings of an extensive investigative work conducted to explore the feasibility of using a Neo4j Graph Database to build a Fast Data Access Layer with near-real time data ingestion from the underlying source systems.
Developer Data Modeling Mistakes: From Postgres to NoSQL
Exploring Neo4j Graph Database as a Fast Data Access Layer
1. Exploring Neo4j Graph Database
to build a
Fast Data Access Layer
with Near-Real time Data Ingestion
Sambit Banerjee
05-April-2020
2. Overview
The future state design considerations for the modernization initiatives of large scale legacy systems often require addressing
various patterns for accessing data from the backend data sources in near-real time by APIs, reports & queries, dashboards,
etc.
Some of the common solutions for such requirements include using –
• replicated reporting databases with optimized data model where data is transformed after replication from the
source databases
• data virtualization (with some degree of local caching of data) / data federation
These solutions work in a lot of cases depending on the degrees of acceptances from the stakeholders.
However, besides the challenges to fulfil the requirement of accessing data in near-real time, all of these solutions have
certain limitations in terms of initial implementation efforts, impacts to the consumer systems, and, and on-going
management of the same.
While working on such a modernization initiative for a large scale legacy system for one of my clients a few months ago,
exploring avenues for a different approach to address some of these challenges was well in order, and, it was decided to
conduct an extensive POC with the Neo4j Graph database.
It was a great exercise for me to have explored the Neo4j Graph database in deep details. The major part of the overall
outcome was pleasantly favorable, although some limitations were observed as well. This document explains the same.
NOTE:- In order to protect the business information of my client, I have used placeholders / fictitious names while describing
the use case, data model, data attributes, queries, etc. in this document.
Sambit Banerjee Page 2
3. POC Objective
Investigate the feasibility of using Neo4j Graph database to establish a low-latency read-only data access
layer with the following characteristics –
1) the data from the data access layer is used by different consumers across the enterprise such as APIs, real-time
reporting (operational, management, ad-hoc), analytics, dashboards, and many more.
2) the data is structured, and, the data access layer is continuously hydrated from multiple backend RDBMS (including
large legacy databases) in near-real time, e.g., within 2-5 minutes of the source databases making the incremental
dataset available to the data access layer for consumption. The volume of the incremental data can be quite large,
e.g., 3 million business transactions generated in a few minutes, during the peak usage of the source systems.
3) maintain the data model of the data access layer as close as the source data models so that the codebases of the
existing reports and queries for APIs don’t have to go through complete or significant overhaul
4) the data access layer can support high performance complex queries (e.g., many joins, filters, sorts, grouping, etc.)
against large volume of structured data, and, produce the resultant dataset to the consumers in sub-seconds time
frame
Demonstrate a complete use case with a high performance query run against the full data volume taken
from an existing large legacy production system
Sambit Banerjee Page 3
4. POC Use Case
Department XYZ has Managers and Agents to manage the Accounts of millions of
Customers. The Agents and the Managers run an operational report multiple
times throughout a business day to monitor the status of various transactions on
the Customer accounts, takes appropriate actions, and reports the same to the sr.
management. The distribution of the subset of overall operational workload
within the XYZ department is shown in this table
Each execution of the selected operational report (based on an Oracle SQL query
– ref: Appendix A) runs against billions of records in the corresponding Oracle
database of the legacy system, and, typically completes in 6 to 8 minutes.
The goal of the POC was to –
a) build the same volume of dataset, while keeping the same data model,
in a Neo4j Graph database.
b) replicate the same Oracle SQL query in Neo4j Cypher query (ref:
Appendix B) with the same logic, and evaluate the performance of the
Neo4j Cypher query against that of the Oracle SQL query. In order to
replicate the similar operational scenario, the Neo4j query should run
with different degrees of session concurrency, with each session
representing either a Manager or an Agent.
c) after loading the initial data volume in the Neo4j Graph database, add
to it the incremental data generated in the legacy Oracle database
during the peak processing window, and, assess the load performance
of the incremental data in Neo4j Graph database.
Manager Agent
Customer
Count by
Agent
Account
Count by
Agent
Customer
Count by
Manager
Account
Count by
Manager
M1
M1-A1 151 313,692
1,081 2,336,592
M1-A2 211 441,344
M1-A3 200 441,744
M1-A4 203 373,115
M1-A5 154 211,451
M1-A6 23 7,816
M1-A7 139 547,430
M2
M2-A1 27 268,540
235 4,764,458
M2-A2 10 126,531
M2-A3 21 400,615
M2-A4 44 954,093
M2-A5 41 1,651,378
M2-A6 92 1,363,301
M3
M3-A1 184 435,564
920 10,045,642
M3-A2 16 455,614
M3-A3 34 875,483
M3-A4 58 1,358,458
M3-A5 52 478,781
M3-A6 59 3,214,290
M3-A7 6 20,415
M3-A8 248 1,343,706
M3-A9 219 838,533
M3-A10 44 1,024,798
Sambit Banerjee Page 4
5. POC Activities
The POC activities, at a high level, included the following –
a) Build Neo4j environment and a Neo4j Graph database in AWS as it was quicker to adjust the size of the Neo4j
database and runtime environment after different tests.
b) Develop 20+ ETL processes to extract the target dataset (17 tables and 356 columns - with appropriate data
masking) from the legacy Oracle database – for both initial and incremental data.
c) Develop 40+ Unix Shell and Cypher scripts to load initial and incremental data in the Neo4j graph database.
d) Develop Cypher query with the same logic as the Oracle SQL query. The complexity of this query involves multi-
level equi-joins and outer joins on 8 entities (i.e., Oracle tables / Graph node types), evaluation and transformation
of data items in the filters and expressions, analytical functions to rank records within subgroups, union, multi-
column sorting with uniqueness. [ref: Appendix A & B]
e) Develop Unix shell scripts and Python programs to run the Cypher query to handle various query parameters and
different degrees of concurrency (e.g., multi-threading), extensive logging of runtime statistics, capture and
consolidate test results. Developing these scripts and programs were needed in lieu of using LoadRunner type tools.
[Why? It’s a different story!].
f) Extract data from the legacy Oracle database and load the same in the Neo4j graph database. The initial data
loading exercise spanned a few weeks as certain data issues were found and corrected, some load scripts were fine
tuned, after which the extraction and the load processes started all over – a few times.
g) Conduct multiple tests to run the Cypher query with different parameters, capture performance statistics, and
analyze.
h) Conduct multiple tests to extract and load incremental data, capture performance statistics, and analyze.
Sambit Banerjee Page 5
6. Neo4j POC Environment
Hardware:
• Single instance Neo4j Graph database hosted in an AWS EC2 instance
• AWS EC2 instance:- m5d.24xlarge - 384 GB RAM, 96 cores vCPU, 5.3 TB SSD & NVMe disks
Software:
• Neo4j 3.5.3 Enterprise Edition
• Python 3.6
o Used for developing test driver programs to orchestrate concurrent executions of a large number of
Cypher query and update scripts against the Neo4j graph database, with extensive logging and
consolidation of test results
Neo4j Graph Instance:
• Neo4j JVM heap = 31 GB
• Neo4j Pagecache = 317 GB
• Size of the Neo4j data store (on disk) after loading initial test data = 2.4 TB
Sambit Banerjee Page 6
7. Neo4j POC Graph Data Elements by numbers, as loaded
Type of Nodes 17
Type of Relationships 27
Number of Nodes 4,880,036,997
Number of Relationships 6,375,650,061
Number of Properties 356
Node Label Count
TxnType1 98,409,635
Exception 7,776,858
LkUp 175
TxnType2 1,031,355,856
AcctState 1,132,603,504
AcctMap 17,146,692
AcctSmry 17,146,692
AcctAttrib 25,543,289
Account 17,146,692
TxnType3 1,267,954,368
Customer 93,838
Personnel 86,136
TxnType4 15,697,975
TxnType5 1,213,105,143
Calendar 34,555
AcctProp 17,146,692
LostTxns 17,146,692
Relationship Type Count
Reln_E 1,642,121
Reln_F 98,409,635
Reln_G 7,776,858
Reln_D1 123,251
Reln_D2 7,776,858
Reln_H 1,132,603,504
Reln_I 1,132,603,504
Reln_J 17,146,692
Reln_K 17,146,692
Reln_B2 2,045,166
Reln_C1 17,146,692
Reln_C2 2,045,166
Reln_L 7,776,858
Reln_M 7,776,858
Reln_B1 17,146,692
Reln_N 25,543,289
Reln_O 2,044,778
Reln_P 17,146,692
Reln_A2 5,513
Reln_Q 1,031,355,856
Reln_A1 111,738
Reln_R 1,213,105,143
Reln_S 15,697,975
Reln_T 1,565,134,368
Reln_U 17,146,692
Reln_V 17,146,692
Reln_W 2,044,778
Sambit Banerjee Page 7
9. Neo4j POC – Cypher Query Test
Test 1
10 concurrent sessions –
• 1 session for 1
Manager for all
Accounts in the
corresponding
portfolio
• 9 sessions for 9
Agents for all
Accounts in their
individual
portfolios
These tests were
conducted with all of the
target data pre-cached
in memory as well as
with partial target data
cached in memory
Sambit Banerjee Page 9
10. Neo4j POC – Cypher Query Test (contd..)
Test 2
3 concurrent sessions
with 3 Managers running
the query concurrently
for all Accounts in their
individual portfolios
These tests were
conducted with all of the
target data pre-cached
in memory as well as
with partial target data
cached in memory
Sambit Banerjee Page 10
11. Neo4j POC – Cypher Query Test (contd..)
Test 3
15 concurrent sessions with 15 Agents running the query concurrently for all Accounts in their individual portfolios
These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory
Sambit Banerjee Page 11
12. Neo4j POC – Cypher Query Test (contd..)
Test 4
26 concurrent sessions
• 3 sessions for 3 Managers for all Accounts in their corresponding portfolios
• 23 sessions for 23 Agents for all Accounts in their individual portfolios
These tests were conducted with all of the target data pre-cached in memory as well as with partial target data cached in memory
Sambit Banerjee Page 12
13. Neo4j POC – Cypher Query Test – Conclusion
Overall, the Neo4j Graph Query tests performed much better than the expectation.
Comparing the query performance of a single instance Neo4j Graph database with that of the legacy Oracle database, it was
observed that -
All Neo4j Cypher queries, under different degrees of concurrencies, completed in less than 25 seconds. Most of the Cypher
queries completed in less than 5 seconds, with ‘All’ or ‘Partial’ target dataset cached in memory
With ‘All’ target dataset cached in memory (equivalent of Oracle warm cache), Neo4j performed consistently under 15
seconds, whereas it took ~46 seconds for the corresponding Oracle SQL query to complete with warm cache in the legacy
Oracle database
With some of the target dataset cached in memory (equivalent of Oracle cold cache), Neo4j performed much better than
Oracle. All Neo4j Cypher queries with cold caching performed consistently under 20 seconds, whereas the same query with
cold cache in the legacy Oracle database completed in 6 to 8 minutes. [Note – the legacy Oracle database was hosted on a
physical server with 990 GB RAM and 40 physical CPUs)
CPU utilization of the AWS EC2 instance hosting the Neo4j Graph database during these tests was low. Out of the 96 vCPUs,
the total CPU consumption of that EC2 instance didn’t exceed 20%, even for the test case with 26 concurrent query
sessions.
Sambit Banerjee Page 13
14. Neo4j POC – Graph Update Test
Identifying Incremental Data
A certain type of business transaction, that made almost 85% of all business transactions carried out in the legacy Oracle database,
was selected in order to evaluate the performance of updating the Neo4j Graph database with incremental data generated in the
legacy Oracle database.
• This was to ensure that the ‘Graph Update use case’ represented the high volume business transactions at a sustained minimum
peak rate of 650 business transactions / second during the daily peak window of the corresponding legacy system.
• 1 business transaction of this type consisted of multiple SQL insert and update operations to 9 Oracle tables corresponding to the
Neo4j node labels – Account, AcctMap, AccState, AcctSmry, TxnType2, TxnType3, TxnType5, LostTxns, Exception.
Collecting Incremental Data
• Data related to 2.9 million business transactions of the selected type was collected from the busiest 4 hours window of the
corresponding legacy production system.
• In order to maintain ACID compliance in the Neo4j Graph database with respect to the corresponding business transactions of
the legacy system, the test data was packaged into a unit called ‘Transaction Group’, where 1 ‘Transaction Group’ contained 1000
business transactions of the selected type. The following table shows the distribution of Oracle records in 1 ‘Transaction Group’ -
1 Transaction Group = 1000 Business Transaction
records from the legacy Oracle database
Min Max Average
# of records for Oracle SQL Insert operations 70 1494 788
# of records for Oracle SQL Update operations 729 16482 8242
Sambit Banerjee Page 14
15. Neo4j POC – Graph Update Test (contd..)
Equivalent Neo4j Graph operations to consume Incremental Data
As the Neo4j Graph data model was created exactly the same as the legacy Oracle data model, the following Neo4j
Graph operations took place for updating the Neo4j Graph data for 1 Business Transaction of the selected type -
Total Neo4j Graph operations for the target
of 650 Business Transactions per second
3,250 117,000 5,850 3,900 7,150
Neo4j Graph operations for 1 Business Transaction
Node Label
Create
Node
Attributes
Created per
New Node
Create
Relationships
Update
Node
Attributes
Changed per
Update
Account 1 1
AcctMap 1 3
AcctState 1 81 2 1 2
AcctSmry 1 1
TxnType2 1 6 1
TxnType3 1 18 1
TxnType5 1 6 1
LostTxns 1 2
Exception 1 69 4 1 2
Total 5 180 9 6 11
Sambit Banerjee Page 15
16. Neo4j POC – Graph Update Test (contd..)
Tests for Updating the Neo4j Graph Database with Incremental Data:
Multiple tests were conducted to load the incremental data in the Neo4j Graph database, sequentially (i.e., the test
driver program running the tests in a single-thread) as well as with varying degree of parallelism (i.e., the test driver
program running the tests via multi-threaded concurrent child processes).
Update tests were carried out with a subset of the test data (500 Transaction Groups) , and then with all test data (i.e.,
2,909 Transaction Groups for 2.9 million Business Transactions)
The Transaction Groups for each test run were distributed equally by the test driver program to all concurrently running
threads at any given point of time.
Each running thread pre-established dedicated connection to the Neo4j Graph database in order to have their own
dedicated sessions to run the corresponding Neo4j Cypher insert/update statements for the Business Transactions
allocated to them.
Due to the mutually exclusive nature the of Business Transactions, each thread ran independent of other parallel
threads without having any sort of application induced contention among each other.
Neo4j Cypher statements (insert and update) were fine tuned a few times in order to improve the performance and
achieve the result as shown in the next few pages.
Sambit Banerjee Page 16
17. Neo4j POC – Graph Update Test (contd..)
Tests Results:
The following table shows the performance metrics of the Neo4j Graph update tests, which fell well short of the
target of loading 650 Business Transactions per second
Test#
# of Threads
(Degree of
parallelism)
Elapsed Time
(hh:mm:ss)
Elapsed Time
(Seconds)
# of Transaction
Groups
Processed
Total # of Business
Transactions
Processed
# of Business
Transactions processed
per Second
1 1 0:40:49 2,448.69 500 500,000 204
2 10 0:37:42 2,262.03 500 500,000 221
3 50 0:39:36 2,375.91 500 500,000 210
4 100 0:40:28 2,428.37 500 500,000 206
5 1 3:13:21 11,601.48 2,909 2,909,000 251
6 10 2:52:06 10,325.99 2,909 2,909,000 282
7 50 3:02:41 10,960.63 2,909 2,909,000 265
8 100 3:06:14 11,174.25 2,909 2,909,000 260
Sambit Banerjee Page 17
18. Neo4j POC – Graph Update Test (contd..)
Tests Results (contd..)
The following table is a comparison of the Neo4j Graph operations between the target of 650 Business Transactions
per second and the achieved max of 282 Business Transactions per second
Key observations:
• Although the AWS EC2 instance hosting the single Neo4j Graph database instance had 96 vCPUs, the max CPU
consumption did not exceed 30% of the total CPU capacity during the load of incremental data.
• The degree of parallelism of 10, i.e., updating the Neo4j Graph database via 10 concurrent connections,
achieved the optimal performance for this test.
Neo4j Graph operations
Create
Node
Attributes
Created per
New Node
Create
Relationships
Update
Node
Attributes
Changed per
Update
Target = 650 Business Operations per second 3,250 117,000 5,850 3,900 7,150
Achieved = 282 Business Operations per second 1,410 50,760 2,538 1,692 3,102
Sambit Banerjee Page 18
19. Neo4j POC – Graph Update Test (contd..)
Performance of Neo4j Graph Insert and Update operations at a glance
This chart shows
the consistency of
performance of the
Neo4j Graph
database insert and
update operations
for most Business
Transactions in
relation to the
corresponding
footprint of
incremental data
packed in those
Business
Transactions.
Sambit Banerjee Page 19
20. Neo4j POC – Graph Update Test – Conclusion
In summary, the Neo4j Graph Update test for this POC achieved a throughput of 282 Business Transactions per second compared
to the target throughput of 650 Business Transactions per second.
However, in my opinion, Neo4j did reasonably well considering the fact that it was somewhat unfair to Neo4j for imposing the
following key constraints –
1) Neo4j Graph data model was kept the same as the data model of the legacy Oracle database.
• The legacy Oracle data model needed a lot of improvements to operate optimally by itself. So, really can’t blame Neo4j.
• Normally, transition from a RDBMS data model to a Graph data model involves quite a bit of optimization to best realize
the benefit of a Graph database. Due to the criteria set for this POC, no data model optimization was done in this POC.
2) All test data for the selected use case was stored in a single Neo4j Graph instance.
• In contrast, the legacy Oracle database stored the large volume (over 20TB) of data in hundreds of partitions at the file
system level, which would be a key factor for achieving high throughput of write transactions against any database.
• Now, on the other hand, the current architecture of Neo4j does not offer the ability for a single Neo4j Graph instance to
store data in partitions at the file system level. This creates a significant limitation for a single Neo4j Graph instance to
achieve high throughout of write transactions against large data volume.
• However, it is possible to partition a large volume of dataset among multiple Neo4j Graph database instances instead of
a single Neo4j Graph database instance, and, then aggregate resultant datasets from the Neo4j queries, run on those
multiple Neo4j Graph instances, at the application level to meet the business requirements. Undoubtedly, this would
require additional work and infrastructure footprints. [Note:- Neo4j v4.x has introduced similar data partitioning feature
via multiple Neo4j Graph database instances, but it’s not quite there yet in terms of offering all types of out-of-the-box
aggregate / analytical functions that can aggregate data across multiple Neo4j Graph database instances.]
Sambit Banerjee Page 20
21. Neo4j Graph POC – Final Thoughts
So, what’s the final verdict?
In my observation, it is certainly possible to use a single Neo4j Graph database instance to build a low-latency read-only data
access layer for fast data access by various types of consumers such as APIs, real-time reporting (operational, management, ad-
hoc), analytics, dashboards, etc.
In terms of the performance of concurrent complex queries against large data volume, this POC demonstrated that Neo4j
certainly passed with flying colors.
Hydrating a single Neo4j Graph database instance with bulk incremental data from various sources in near-real time is also
possible, by –
• establishing an optimized data model, especially when transitioning from the legacy RDBMS systems. Keep only those
data attributes in the Neo4j Graph database that are frequently accessed by the consumers of this fast data access layer.
• evaluating the max consumption throughput capacity of the single Neo4j Graph database instance as applicable for the
selected use cases. Use those metrics as among the key considerations for sizing the Neo4j environment (i.e., max
volume of data to store in the Neo4j database instance, CPU and memory, etc.)
• determining optimal patterns and frequencies for loading incremental data from the source systems. Evaluate the usage
patterns of the incremental data, and prioritize the load sequences of the associated nodes / relationships / attributes.
For example, if an incremental dataset contains updates of 50 attributes, and, only 10 of those attributes are accessed
by the consumers in near-real time while the remaining 40 attributes are accessed from the nightly batch/report jobs,
then those 10 attributes may be prioritized for the real-time load, and, a lazy load of the remaining 40 attributes may be
implemented.
With that, goodbye for now and take care!
Sambit Banerjee Page 21
23. Appendix A – Oracle SQL Query
Parameters:- {PARAM1}, {PARAM2}
WITH
t1_list AS (SELECT c1 FROM Customer WHERE c1 IN (....) )
, p_dt AS (SELECT TRUNC(NVL(MAX(CAST(a.dt1 AS DATE)),
SYSDATE)) AS c_dt FROM tmp_dt a)
, mgr_agent AS
(SELECT * FROM (SELECT c.c1, c.c2, trim(c.c3) || nvl2(c.c4, ' ' ||
trim(c.c4) || ' ', ' ') || trim(c.c5) as cust_pers_alias1 FROM Customer c
JOIN cust_pers cp ON ( c.k1 = cp.k1 AND cp.type_cd = 2 AND cp.eff_dt
<= trunc(sysdate) AND (trunc(sysdate) < cp.exp_dt OR cp.exp_dt OR
cp.exp_dt is null) ) JOIN Personnel p ON (cp.k1 = c.k1) )
WHERE c.c1 in ("{PARAM1}") AND (:agnt = 'ALL' OR :agnt = agnt_nme)
AND (:mgr = 'ALL' OR :mgr = mgr_nme) )
, cal_rec AS
(SELECT cl.col1, ..., ..., ..., ...,..., colN FROM Calendar cl
WHERE "{PARAM2}" BETWEEN cl.dt1 and dt2 )
, hsr AS
( SELECT bel.k2, ma.c1, ma.c2, ma.c3, ma.c4, bel.c1 AS alias1,
bel.c2 AS alias2, bel.c3 AS alias3, bel.c4 AS alias4, ..., bel.c10 ,
CASE bel.cd1 WHEN 1 THEN 0 + CASE WHEN bel.cd3 IN
(n1, n2, n3, n4, n5, n6) THEN 0 ELSE 4 END
+ CASE WHEN bel.cd8 IN (2,3) THEN 1 WHEN bel.cd8 = 0 THEN 2
WHEN bel.cd8 = 6 THEN 3 ELSE 4 END WHEN 3 THEN 9 END AS
sort_order, ABS(bel.amt1), lt.seq_no1 AS expr_pri,
CASE bel.cd4 WHEN 1 THEN '...' WHEN 3 THEN '...' END AS
cd4_type, lf.cd3 AS someType, CASE bel.cd4 WHEN 1 THEN CASE
WHEN bel.cd7 IN (n1, n2, n3, n4, n5, n6) THEN '...' ELSE '....' END
WHEN 3 THEN CASE WHEN ABS(bel.amt8) <= 1 THEN '.....‘ WHEN
ABS(bel.amt8) <= 100 THEN '.....‘ WHEN ABS(bel.amt8) <= 1000
THEN '.....‘ ELSE '.....‘ END END AS cat_2
FROM mgr_agent ma JOIN Customer c ON (ma.c2 = c.c2) JOIN
Exception bel ON (sa.c2 = bel.c2) JOIN Account a ON
(a.c1 = bel.c1) JOIN cal_rec cr ON (cr.id1 = bel.id1)
JOIN AcctMap am ON (am.id1 = a.id1) JOIN AcctAttrib lf ON
(lf.id2 = am.id2) JOIN TxnType3 tt ON (bel.id4 = tt.id4)
WHERE bel.cd9 = 6 AND bel.cd7 = 1 AND (bel.cd4 = 1 OR bel.cd4 = 3)
AND ( (:sType = 'ALL') OR (lf.s_cd IN ( SELECT lst1.s_cd FROM
lkup1 lst1 WHERE lst1.s_desc = :sType AND rownum > 0 ) ) )
AND ( (:rType = 'ALL') OR (a.r_cd IN ( SELECT lrt1.r_cd FROM
lkup2 lrt1 WHERE lrt1.r_desc = :sType AND rownum > 0 ) ) )
AND bel.id3 = bel.id5 )
, lost_txns AS
(SELECT a.id1, ma.c1, ma.c2, ma.c3, ma.c4, a.id2, ma.c7 AS alias1,
a.c2 AS alias2, NULL AS alias3, 7 AS alias4, '.....' AS alias5,
CASE WHEN (dt1 > dt11 AND dt1 <= r_e_dt) THEN CASE
WHEN (a.dt1 >= cr.dt2 ) THEN '1...‘ WHEN (at_rcl.dt1 >=
cr.dt4 AND at_rcl.t_cd = 33 ) THEN '2....‘ WHEN (at_rcl.dt1 >=
cr.dt4 AND at_rcl.t_cd = 32 ) THEN '3....‘ WHEN (at_rcl.dt1 >=
cr.dt4 AND at_rcl.t_cd = IN (22, 26) ) THEN '4....‘ ELSE
'5....‘END ELSE NULL END AS alias6, NULL AS alias7, NULL AS
alias8, asmry.id3 AS alias9, 10 AS alias10, NULL AS alias11,
NULL AS excp_pri, '.....' AS pr_type, lxn.sType_cd AS sType,
lxn.rType_cd AS rType, CASE WHEN lxn.sType_cd IN (2,3)
THEN '11...' ELSE '12....' END AS catg_1
FROM mgr_agnt ma,
CROSS JOIN p_dt pd JOIN LostTxns lxn ON (ma.id1 = lxn.id4)
JOIN AcctSmry asmry ON (lxn.id1 = lcrps.id1) JOIN calc_rec cr
ON (asmry.id1 = cr.id1 AND pd.c_dt > cr.dt3 AND pd.c_dt <=
cr.dt4) JOIN Account a ON (lxn.id1 = a.id1 AND a.cd10 = 1)
LEFT OUTER JOIN TxnType1 at_rcl ON (lxn.id1 = at_rcl.id2
AND at_rcl.cd3 IN (m1, m2, m3, m4) AND at_rcl.cd6 = 14
AND at_rcl.dt2 BETWEEN cr.st_dt AND cr.e_dt )
WHERE lxn.ind2 = 'N' AND asmry.ind2 = 'N' AND lxn.cd9 <> 2 AND
( (:sType = 'ALL') OR (lxn.sType_cd IN (SELECT lst1.s_cd FROM
lkup1 lst1 WHERE lst1.s_desc = :sType AND rownum > 0 ) ) )
AND ((:rType = 'ALL') OR (a.r_cd IN ( SELECT lrt1.r_cd FROM
lkup2 lrt1 WHERE lrt1.r_desc = :sType AND rownum > 0 ) ) ) )
, all_ex AS
(SELECT a.* FROM hsr a
UNION ALL
SELECT b.*, aliasN as catg_2 FROM lost_txns b )
/* Main query */
SELECT col1, col2, col3,..., substr(catg_1, 3) AS catg_1,
substr(catg_2, 2) AS catg_2, count(*) AS num_accts,
count(distinct ae2.id1) as num_distinct_accts,
substr(catg_1, 1, 2) as catg_1_order,
substr(catg_2, 1, 1) as catg_2_order
FROM (SELECT ae.*,
ROW_NUMBER() OVER (PARTITION BY ae.id1
ORDER BY ae.excp_pri DESC NULLS LAST) as rn
FROM all_ex ae) ae2
WHERE ae2.rn = 1 AND
(:eType = 'ALL' OR :eType = ae2.pr_type)
GROUP BY col1, col2, col3, col4, col5,
substr(catg_1, 1, 2), substr(catg_2, 1, 1);
NOTE:- This SQL query was actually 6 pages long. I have
shortened it to fit here by omitting a lot of column names
and textual values (re: ‘…’ items), replacing some of the
actual column names with fictitious names, etc. Same
applies for the Neo4j Cypher query in Appendix – B.
Sambit Banerjee Page 23
24. Appendix B – Neo4j Cypher Query
CALL apoc.cypher.run("
MATCH (ps:Personnel)-[:Reln_A2]->(c:Customer)-[:Reln_W]->(lxn:LostTxns)<-[:Reln_O]-(a:Account)-[:Reln_B1]-> (asmry:AcctSmry)-[:Reln_C2]->(cs:Calendar)
WHERE ('ALL' IN $mgr_list_in OR c.c_no IN $mgr_list_in) AND ('ALL' = $mgr_in OR ps.mgr_name = $mgr_in) AND ('ALL' = $agnt_in OR CASE WHEN ps.c4 IS NULL OR ps.c4 = '‘ THEN trim(ps.c3) + ' ‘
+ trim(ps.c5) ELSE trim(ps.c3) + ' ' + trim(ps.c4) + ' ' + trim(ps.c5) END = $agnt_in) AND (cs.i_e_dt <= $dt_in AND $dt_in < cs.e_dt)
OPTIONAL MATCH (a)-[:Reln_E]->(at:TxnType1)
WHERE (cs.s_dt <= at.dt1 AND at.dt1 < cs.e_dt)
RETURN a.id1 AS id1, -1 AS excp_pri, c.col1 AS col1, ps.c7 AS agnt_nme, ps.c8 AS mgr_nme, a.id2 AS id2, c.c_no, 0 AS pr_type, CASE WHEN (a.dt3 >= cs.s_dt) THEN {srt: 1, catg_label: '....'}
WHEN (at.dt1 >= cs.s_dt AND at.t_cd = 33) THEN {srt: 2, catg_label: '....'} WHEN (at.dt1 >= cs.s_dt AND at.t_cd = 32) THEN {srt: 3, catg_label: '....'}
WHEN (at.dt1 >= cs.s_dt AND at.t_cd IN [22,26]) THEN {srt: 4, catg_label: '....'} ELSE {srt: 5, catg_label: '....'} END AS catg_1, {srt: 1, catg_label: ''} AS catg_2, a.r_cd AS r_cd
UNION ALL
MATCH cs_bel_lt_paths=(cs:Calendar)<-[:Reln_G]-(bel:Exception)-[:Reln_D1]->(lt:TxnType3)
WHERE (cs.s_dt <= $dt_in AND $dt_in < cs.e_dt) AND bel.id4 = bel.id7
WITH lt.id1 AS id1, cs_bel_lt_paths
ORDER BY lt.s_no DESC
WITH id1, COLLECT(cs_bel_lt_paths)[..1] AS ltst_cs_bel_lt_paths
UNWIND ltst_cs_bel_lt_paths AS p
WITH id1, nodes(p) AS ns
WITH id1, ns[0] AS cs, ns[1] as bel, ns[2] as lt
MATCH (lt)<-[:Reln_T]-(a:Account)<-[:Reln_J]-(am:AcctMap)-[:Reln_K]->(lf:AcctAttrib), (am)<-[:Reln_U]-(c:Customer)<-[:Reln_A2]-(ps:Personnel)
WHERE ('ALL' = $mgr_in OR ps.mgr_name = $mgr_in) AND ('ALL' = $agnt_in OR
CASE WHEN ps.c4 IS NULL OR ps.c4 = '‘ THEN trim(ps.c3) + ' ' + trim(ps.c5) ELSE trim(ps.c3) + ' ' + trim(ps.c4) + ' ' + trim(ps.c5) END = $agnt_in)
RETURN lt.id1 AS id1, lt.s_no AS excp_pri, c.nme AS nme, ps.c7 AS agnt_nme, ps.c8 AS mgr_nme, a.id2 AS id2, c.c_no, bel.l_cd AS pr_type, CASE bel.l_cd WHEN 1 THEN CASE WHEN bel.cd3 IN
[n1, n2, n3, n4, n5, n6] THEN {srt: 21, catg_label: '....'} ELSE {srt: 22, catg_label: '....'} END WHEN 3 THEN CASE WHEN ABS(bel.amt3) <= 1 THEN {srt: 31, catg_label: '....'} WHEN ABS(bel.amt3) <= 100
THEN {srt: 32, catg_label: '....'} WHEN ABS(bel.amt3) <= 1000 THEN {srt: 33, catg_label: '....'} ELSE {srt: 34, catg_label: '....'} END END AS catg_1, CASE bel.l_cd WHEN 1 THEN CASE WHEN bel.s_cd IN [2, 3]
THEN {srt: 1, catg_label: '....'} ELSE {srt: 2, catg_label: '....'} END END AS catg_2, a.r_cd AS r_cd
", {dt_in:date($dt), mgr_list_in:$mgr_list, mgr_in: $mgr, agnt_in: $agnt}) yield value
WITH value AS v
ORDER BY v.id1, v.excp_pri DESC
WITH v.id1 AS id1, COLLECT(v)[..1] AS v0
UNWIND v0 as row
RETURN row.mgr_nme AS mgr_nme, row.agnt_nme AS agnt_nme, row.c_no AS c_no, row.c_nme AS c_nme, row.excp_pri, row.pr_typ AS excp_type, row.catg_1.catg_label AS catg_1_label,
row.catg_2.catg_label AS catg_2_label, COUNT(*) AS num_accts, COUNT(DISTINCT row.id2) AS num_dist_accts
Sambit Banerjee Page 24