Standardized Data Management
Data Ingestion Platform
Arun Manivannan
Senior Data Engineer
Our Data
History
Our Dataflow
Discuss some
interesting
problems
Questions
So, what do we do now?
Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
History
2012
Enterprise Data
Management on Teradata
Now
Unified ingestion platform
Until a year ago
Legacy Ingestion
framework
<2012
Traditional group-wise Data
Warehouses
2014
EDM on Hadoop
CLEANSE VALIDATERECEIVE CONSTRUCTRECORD
What is our view of Ingestion Framework?
PREPROCESSING PROCESSING
SECURITY AND LINEAGE TRACKING
VALIDATERECEIVE CONSTRUCTRECORD
Cleanse
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
VALIDATERECEIVE CONSTRUCTRECORD
Validate
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
Essential pre-processing
RECORDRECEIVE CLEANSE VALIDATE CONSTRUCT
1
2
3
4
5
Column and Row count validation
Embedded new line removal and special character replacement
Datatype validation
Data transformation
Value defaulting
VALIDATERECEIVE CONSTRUCTRECORD
Record
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
VALIDATERECEIVE CONSTRUCTRECORD
Construct
PREPROCESSING PROCESSING
CLEANSE
SECURITY AND LINEAGE TRACKING
Backing technologies
RECORDRECEIVE VALIDATE CONSTRUCT
Change Data
Capture
CLEANSE
Generation 2 Awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
Generation 2 Awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
Extending NiFi via Custom Processors
Good Problems
Don’t bring me anything but trouble.
Good news weakens me.
- Charles Kettering
RECORDRECEIVE VALIDATE CONSTRUCT
Types of Data
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
TYPES OF DATA
CLEANSE
Types of Data
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Today’s EOD
snapshot data
Monday’s snapshot
data
Tuesday’s snapshot
data
Wednesday’s
snapshot data
Master Transactional
History Current
CLEANSE
Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro-JSON messages
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
Data accumulated
until Tuesday
Today’s changes
Wednesday’s
snapshot data
Frequency - Hourly Incremental
hourly changesToday’s changes
Today’s changes
Today’s changesWednesday’s hourly
changes
hourly changes
Wednesday’s hourly
changes
Wednesday’s
snapshot data
RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE
Data formats
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
Data formats - CDC Delimited
TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>….
2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890
CDC Columns Application Columns
Could be one of I, B, A, D
CDC Delimited - Snapshot building
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Snapshot building
History Snapshot
D records moves to History I or latest A moves to Snapshot
CLEANSE
Data formats and Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
Data formats - Delimited with Header/Trailer
H3
TRANS_ID_1|USER_ID1|100.00
TRANS_ID_2|USER_ID2|3.00
TRANS_ID_3|USER_ID1|20.00
T123.00
» Header and trailer has meta information about the data in the file
» Varieties of header and trailer formats
Row count of the file
Sum or a rolling checksum of column(s) in that file
* Header and trailer information is used during reconciliation
The final piece to the puzzle
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
Among others already discussed,
» Schema evolution
» Cascading re-runs
» full-dump override
PROCESSING
» TYPES OF DATA
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
More awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Metadata management UI1
Schema evolution & retrofitting historic data2
Reruns, Cascading reruns, full-dump override3
One-touch deployment4
CLEANSE
To summarise
Our code is (still) manageable and extensible
1
The volume, the variety and the interesting combinations of
data presented us some very interesting problems to solve.
2
And… we intend to open source the framework for the
general audience.
3
4
With HDF and HDP, we were able to abstract away the cross
cutting concerns such as security and concurrency.
THANK YOU !
Questions?
@arunma

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform

  • 1.
    Standardized Data Management DataIngestion Platform Arun Manivannan Senior Data Engineer
  • 2.
    Our Data History Our Dataflow Discusssome interesting problems Questions So, what do we do now?
  • 3.
    Data in SCBcontext Variety of applications - Web, Batch, Mainframes 1 Hundreds of applications (~180 data lake source apps) 2 Variety of consumption patterns 3 Multi-regulatory guided data storage 4 5 50+ countries
  • 4.
    Data in SCBcontext Variety of applications - Web, Batch, Mainframes 1 Hundreds of applications (~180 data lake source apps) 2 Variety of consumption patterns 3 Multi-regulatory guided data storage 4 5 50+ countries
  • 5.
    History 2012 Enterprise Data Management onTeradata Now Unified ingestion platform Until a year ago Legacy Ingestion framework <2012 Traditional group-wise Data Warehouses 2014 EDM on Hadoop
  • 6.
    CLEANSE VALIDATERECEIVE CONSTRUCTRECORD Whatis our view of Ingestion Framework? PREPROCESSING PROCESSING SECURITY AND LINEAGE TRACKING
  • 7.
  • 8.
  • 9.
    Essential pre-processing RECORDRECEIVE CLEANSEVALIDATE CONSTRUCT 1 2 3 4 5 Column and Row count validation Embedded new line removal and special character replacement Datatype validation Data transformation Value defaulting
  • 10.
  • 11.
  • 12.
    Backing technologies RECORDRECEIVE VALIDATECONSTRUCT Change Data Capture CLEANSE
  • 13.
    Generation 2 Awesomeness RECORDRECEIVEVALIDATE CONSTRUCT Faster development cycle for new applications. Easier to reason with the flow with NiFi’s visual flow representation. 1 Consistent tooling for preprocessing, ops data management, error reporting and archival 2 Significantly faster processing through Spark 3 ORC performs well for most of our consumption patterns - supporting both predicate and projection push down. 4 5 Security and managed concurrency CLEANSE Change Data Capture
  • 14.
    Generation 2 Awesomeness RECORDRECEIVEVALIDATE CONSTRUCT Faster development cycle for new applications. Easier to reason with the flow with NiFi’s visual flow representation. 1 Consistent tooling for preprocessing, ops data management, error reporting and archival 2 Significantly faster processing through Spark 3 ORC performs well for most of our consumption patterns - supporting both predicate and projection push down. 4 5 Security and managed concurrency CLEANSE Change Data Capture
  • 15.
    Extending NiFi viaCustom Processors
  • 16.
    Good Problems Don’t bringme anything but trouble. Good news weakens me. - Charles Kettering
  • 17.
    RECORDRECEIVE VALIDATE CONSTRUCT Typesof Data RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) TYPES OF DATA CLEANSE
  • 18.
    Types of Data RECORDRECEIVEVALIDATE CONSTRUCT Data accumulated until Yesterday Today’s changes Today’s EOD snapshot data Monday’s snapshot data Tuesday’s snapshot data Wednesday’s snapshot data Master Transactional History Current CLEANSE
  • 19.
    Frequency RECORDRECEIVE VALIDATE CONSTRUCT »Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro-JSON messages » Spreadsheets DATA FORMATS TYPES OF DATA » Streaming » Hourly incremental » Daily » Weekly FREQUENCY CLEANSE
  • 20.
    Data accumulated until Tuesday Today’schanges Wednesday’s snapshot data Frequency - Hourly Incremental hourly changesToday’s changes Today’s changes Today’s changesWednesday’s hourly changes hourly changes Wednesday’s hourly changes Wednesday’s snapshot data RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE
  • 21.
    Data formats RECORDRECEIVE VALIDATECONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro messages » XML » Spreadsheets DATA FORMATS TYPES OF DATA » Streaming » Hourly incremental » Daily » Weekly FREQUENCY CLEANSE
  • 22.
    Data formats -CDC Delimited TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>…. 2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890 CDC Columns Application Columns Could be one of I, B, A, D
  • 23.
    CDC Delimited -Snapshot building RECORDRECEIVE VALIDATE CONSTRUCT Data accumulated until Yesterday Today’s changes Snapshot building History Snapshot D records moves to History I or latest A moves to Snapshot CLEANSE
  • 24.
    Data formats andFrequency RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) DATA FORMATS TYPES OF DATA » Streaming » Hourly incremental » Daily » Weekly FREQUENCY CLEANSE » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro messages » XML » Spreadsheets
  • 25.
    Data formats -Delimited with Header/Trailer H3 TRANS_ID_1|USER_ID1|100.00 TRANS_ID_2|USER_ID2|3.00 TRANS_ID_3|USER_ID1|20.00 T123.00 » Header and trailer has meta information about the data in the file » Varieties of header and trailer formats Row count of the file Sum or a rolling checksum of column(s) in that file * Header and trailer information is used during reconciliation
  • 27.
    The final pieceto the puzzle RECORDRECEIVE VALIDATE CONSTRUCT » Master (eg. Customer data) » Transaction (eg. Banking transactions) » Transaction data as Master (eg. editable transactions) DATA FORMATS » Streaming » Hourly incremental » Daily » Weekly FREQUENCY Among others already discussed, » Schema evolution » Cascading re-runs » full-dump override PROCESSING » TYPES OF DATA CLEANSE » Output of Change Data Capture systems (Delimited) » Plain delimited » Fixed width » Avro messages » XML » Spreadsheets
  • 28.
    More awesomeness RECORDRECEIVE VALIDATECONSTRUCT Metadata management UI1 Schema evolution & retrofitting historic data2 Reruns, Cascading reruns, full-dump override3 One-touch deployment4 CLEANSE
  • 29.
    To summarise Our codeis (still) manageable and extensible 1 The volume, the variety and the interesting combinations of data presented us some very interesting problems to solve. 2 And… we intend to open source the framework for the general audience. 3 4 With HDF and HDP, we were able to abstract away the cross cutting concerns such as security and concurrency.
  • 30.