SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform
The document discusses the evolution and architecture of a standardized data management and ingestion platform led by Arun Manivannan, highlighting the complexities involved in managing vast amounts of data across various applications and regulatory environments. It emphasizes the transition from legacy systems to a unified ingestion framework, detailing key processes such as data validation, cleansing, and change data capture. The aim is to enhance processing efficiency, manage data flows effectively, and eventually open source the developed framework.
Data in SCBcontext
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
4.
Data in SCBcontext
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
5.
History
2012
Enterprise Data
Management onTeradata
Now
Unified ingestion platform
Until a year ago
Legacy Ingestion
framework
<2012
Traditional group-wise Data
Warehouses
2014
EDM on Hadoop
Essential pre-processing
RECORDRECEIVE CLEANSEVALIDATE CONSTRUCT
1
2
3
4
5
Column and Row count validation
Embedded new line removal and special character replacement
Datatype validation
Data transformation
Value defaulting
Generation 2 Awesomeness
RECORDRECEIVEVALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
14.
Generation 2 Awesomeness
RECORDRECEIVEVALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
RECORDRECEIVE VALIDATE CONSTRUCT
Typesof Data
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
TYPES OF DATA
CLEANSE
18.
Types of Data
RECORDRECEIVEVALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Today’s EOD
snapshot data
Monday’s snapshot
data
Tuesday’s snapshot
data
Wednesday’s
snapshot data
Master Transactional
History Current
CLEANSE
19.
Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
»Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro-JSON messages
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
20.
Data accumulated
until Tuesday
Today’schanges
Wednesday’s
snapshot data
Frequency - Hourly Incremental
hourly changesToday’s changes
Today’s changes
Today’s changesWednesday’s hourly
changes
hourly changes
Wednesday’s hourly
changes
Wednesday’s
snapshot data
RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE
21.
Data formats
RECORDRECEIVE VALIDATECONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
22.
Data formats -CDC Delimited
TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>….
2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890
CDC Columns Application Columns
Could be one of I, B, A, D
23.
CDC Delimited -Snapshot building
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Snapshot building
History Snapshot
D records moves to History I or latest A moves to Snapshot
CLEANSE
24.
Data formats andFrequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
25.
Data formats -Delimited with Header/Trailer
H3
TRANS_ID_1|USER_ID1|100.00
TRANS_ID_2|USER_ID2|3.00
TRANS_ID_3|USER_ID1|20.00
T123.00
» Header and trailer has meta information about the data in the file
» Varieties of header and trailer formats
Row count of the file
Sum or a rolling checksum of column(s) in that file
* Header and trailer information is used during reconciliation
27.
The final pieceto the puzzle
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
Among others already discussed,
» Schema evolution
» Cascading re-runs
» full-dump override
PROCESSING
» TYPES OF DATA
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
To summarise
Our codeis (still) manageable and extensible
1
The volume, the variety and the interesting combinations of
data presented us some very interesting problems to solve.
2
And… we intend to open source the framework for the
general audience.
3
4
With HDF and HDP, we were able to abstract away the cross
cutting concerns such as security and concurrency.