©2018 Impetus Technologies, Inc. All rights reserved.
You are prohibited from making a copy or modification of, or from redistributing,
rebroadcasting, or re-encoding of this content without the prior written consent of
Impetus Technologies.
This presentation may include images from other products and services. These
images are used for illustrative purposes only. Unless explicitly stated there is no
implied endorsement or sponsorship of these products by Impetus Technologies. All
copyrights and trademarks are property of their respective owners.
Planning your Next-Gen Change Data Capture (CDC) Architecture
December 19, 2018
Agenda
What is CDC?
Various methods for CDC in the enterprise data warehouse
Key considerations for implementing a next-gen CDC architecture
Demo
Q&A
About Impetus
We exist to create powerful and intelligent enterprises through deep
data awareness, data integration, and data analytics.
About Impetus
Many of North America’s most respected and well-known brands trust
us as their strategic big data and analytics partner.
Transformation
Legacy EDW to
big data/cloud
Unification
Data processing,
preparation, and
access
Analytics
Real-time, machine
learning, and AI
Self-service
BI on big data/
cloud
End-to-End Big Data Solutions
What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?
Continuous Ingestion in the Data Lake
Capturing streaming data changes
Database migration to cloud
Data preparation for analytics and ML jobs
We still have a legacy system
Our Speakers Today
SAURABH DUTTA
Technical Product Manager
SAMEER BHIDE
Senior Solutions Architect
What is Change Data Capture (CDC)?
CDC is the process of capturing changes made at the data source and
applying them throughout the enterprise.
Let’s Take a Closer Look
Source Database Target Database
Let’s Take a Closer Look
Source Database Target Database
WebApp
Let’s Take a Closer Look
Source Database Target Database
Customer {
Telephone: “111”
}
Customer {
Telephone: “111”
}
Create
WebApp
Let’s Take a Closer Look
Change Data
Capture Event
Source Database Target Database
Customer {
Telephone: “111”
}
Customer {
Telephone: “111”
}
Customer {
Telephone: “111”
}
Create
WebApp
Let’s Take a Closer Look
Change Data
Capture Event
Source Database Target Database
Customer {
Telephone: “222”
}
Customer {
Telephone: “222”
}
Customer {
Telephone: “222”
}
Update
WebApp
What Does CDC Mean for the Enterprise?
Batch
Replicate
Filter Transform
In-memory
Batch
RDBMS
Data
warehouse
Files
RDBMS
Data
warehouse
Hadoop
Streaming
Legacy
Real-time Incremental
Modern CDC Applications
Data lake: Continuous ingestion and pipeline
automation
Streaming: Data changes to Kafka, Kinesis, or
other queues
Cloud: Data workload migration
Business applications: Data preparation for
analytics and ML jobs
Legacy system: Data delivery and query offload
Data lakes
CloudStreaming
Data
warehouse
Files Legacy
RDBMS
Methods of Change Data Capture
Database triggers Data modification stamping Log based CDC
Database Triggers
Uses shadow tables
Challenges
• Introduces overhead
• Increases load to retrieve
• Loses intermediate changes
Date Modification Stamping
Transactional applications keep track of metadata in every row
• Tracks when the row was created and last modified
• Enables filter on the DATE_MODIFIED column
Challenges
• There is no DATE_MODIFIED for a deleted row
• Trigger based DATE_MODIFIED
• Extracting is resource intensive
Log Based CDC
Uses transactional logs
Challenges
• Interpreting transaction logs
• No direct interface to transaction log by vendors
• Agents and interfaces change with new database versions
• Supplemental logging increases volume of data
Run initial
load
Incremental
updates
Change Data Capture Implementation Steps
Enable CDC
for database
Define
target
Table to
handle CDC
states
Prepare table
for CDC
Next Gen Architecture Considerations
Ease of Use
Pre-packaged operators
Extensibility
Modern user experience
Real-time
Change Data Capture
Stream live updates
Optimized for high performance
Hybrid
Multiple vendors
On-premise and cloud
Databases, data warehouse,
and data lake
Value Proposition of CDC
Incremental update efficiency
Source/production impact
Time to value
TCO
Scale and flexibility
Continuous Ingestion in the Data Lake 46%
Capturing streaming data changes 58%
Database migration to cloud 38%
Data preparation for analytics and ML jobs 35%
We still have a legacy system 46%
What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?
ETL, Real-time Stream Processing and Machine Learning Platform
+ A Visual IDE for Apache Spark
CDC with StreamAnalytix
Turnkey
adapters for
CDC vendor
ETL and data
wrangling
visual
operators
Elastic
compute
ReconcileTransform Enrich
Structured
data stores
CDC streams
Unstructured
data streams
File stores
Structured
data stores
Message
queues
Hadoop/Hive
Cloud storage
and DW
CDC Capabilities in StreamAnalytix
Integration with CDC providers
CDC Capabilities in StreamAnalytix
LogMiner integration
CDC Capabilities in StreamAnalytix
Turnkey reconciliation feature for Hadoop offload
CDC Capabilities in StreamAnalytix
Large set of visual operators for ETL,
analytics, and stream processing
Zero code approach to ETL design
Built in NFR support
StreamAnalytix CDC Solution Design
StreamAnalytix Workflow
A complete CDC solution has three parts:
Each aspect of the solution is modelled as a StreamAnalytix pipeline
Data
de-normalization
Join transactional data
with data at rest, and
store de-normalized data
on HDFS
Merge previously
processed transactional
data with new
incremental updates
Incremental
updates in Hive
Data ingestion
and staging
Stream data from Attunity,
replicate from Kafka or
LogMiner for multiple
tables, and store raw data
into HDFS
Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
Workflow: Oozie Coordinator Job
Oozie orchestration flow created using StreamAnalytix
webstudio – it orchestrates pipeline#2 and pipeline#3 into a
single Oozie flow that can be scheduled as shown here
Demo
Data channels
• Attunity Replicate and LogMiner
Data processing pipeline walkthrough
• Data filters and enrichment
• Analytics and data processing operators
• Data stores
Summary
Do more with your data acquisition flows
• Acquire and process data in real-time
• Enrich data from data marts
• Publish processed data as it arrives
• Multiple parallel processing paths (read once, process multiple times)
Move away from fragmented processes
• Unify data analytics and data processing/ETL flows
Conclusion
The right next gen CDC solution can make data ready for analytics as
it arrives in near real-time
CDC-based data integration is far more complex than the full export
and import of your database
A unified platform simplifies and reduces the complexities of
operationalizing CDC flows
LIVE Q&A
For a free trial download or cloud access visit www.StreamAnalytix.com
For any questions, contact us at inquiry@streamanalytix.com

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar

  • 1.
    ©2018 Impetus Technologies,Inc. All rights reserved. You are prohibited from making a copy or modification of, or from redistributing, rebroadcasting, or re-encoding of this content without the prior written consent of Impetus Technologies. This presentation may include images from other products and services. These images are used for illustrative purposes only. Unless explicitly stated there is no implied endorsement or sponsorship of these products by Impetus Technologies. All copyrights and trademarks are property of their respective owners.
  • 2.
    Planning your Next-GenChange Data Capture (CDC) Architecture December 19, 2018
  • 3.
    Agenda What is CDC? Variousmethods for CDC in the enterprise data warehouse Key considerations for implementing a next-gen CDC architecture Demo Q&A
  • 4.
    About Impetus We existto create powerful and intelligent enterprises through deep data awareness, data integration, and data analytics.
  • 5.
    About Impetus Many ofNorth America’s most respected and well-known brands trust us as their strategic big data and analytics partner.
  • 6.
    Transformation Legacy EDW to bigdata/cloud Unification Data processing, preparation, and access Analytics Real-time, machine learning, and AI Self-service BI on big data/ cloud End-to-End Big Data Solutions
  • 7.
    What are thedifferent change data capture use cases currently deployed in your organization (choose all that apply)? Continuous Ingestion in the Data Lake Capturing streaming data changes Database migration to cloud Data preparation for analytics and ML jobs We still have a legacy system
  • 8.
    Our Speakers Today SAURABHDUTTA Technical Product Manager SAMEER BHIDE Senior Solutions Architect
  • 9.
    What is ChangeData Capture (CDC)? CDC is the process of capturing changes made at the data source and applying them throughout the enterprise.
  • 10.
    Let’s Take aCloser Look Source Database Target Database
  • 11.
    Let’s Take aCloser Look Source Database Target Database WebApp
  • 12.
    Let’s Take aCloser Look Source Database Target Database Customer { Telephone: “111” } Customer { Telephone: “111” } Create WebApp
  • 13.
    Let’s Take aCloser Look Change Data Capture Event Source Database Target Database Customer { Telephone: “111” } Customer { Telephone: “111” } Customer { Telephone: “111” } Create WebApp
  • 14.
    Let’s Take aCloser Look Change Data Capture Event Source Database Target Database Customer { Telephone: “222” } Customer { Telephone: “222” } Customer { Telephone: “222” } Update WebApp
  • 15.
    What Does CDCMean for the Enterprise? Batch Replicate Filter Transform In-memory Batch RDBMS Data warehouse Files RDBMS Data warehouse Hadoop Streaming Legacy Real-time Incremental
  • 16.
    Modern CDC Applications Datalake: Continuous ingestion and pipeline automation Streaming: Data changes to Kafka, Kinesis, or other queues Cloud: Data workload migration Business applications: Data preparation for analytics and ML jobs Legacy system: Data delivery and query offload Data lakes CloudStreaming Data warehouse Files Legacy RDBMS
  • 17.
    Methods of ChangeData Capture Database triggers Data modification stamping Log based CDC
  • 18.
    Database Triggers Uses shadowtables Challenges • Introduces overhead • Increases load to retrieve • Loses intermediate changes
  • 19.
    Date Modification Stamping Transactionalapplications keep track of metadata in every row • Tracks when the row was created and last modified • Enables filter on the DATE_MODIFIED column Challenges • There is no DATE_MODIFIED for a deleted row • Trigger based DATE_MODIFIED • Extracting is resource intensive
  • 20.
    Log Based CDC Usestransactional logs Challenges • Interpreting transaction logs • No direct interface to transaction log by vendors • Agents and interfaces change with new database versions • Supplemental logging increases volume of data
  • 21.
    Run initial load Incremental updates Change DataCapture Implementation Steps Enable CDC for database Define target Table to handle CDC states Prepare table for CDC
  • 22.
    Next Gen ArchitectureConsiderations Ease of Use Pre-packaged operators Extensibility Modern user experience Real-time Change Data Capture Stream live updates Optimized for high performance Hybrid Multiple vendors On-premise and cloud Databases, data warehouse, and data lake
  • 23.
    Value Proposition ofCDC Incremental update efficiency Source/production impact Time to value TCO Scale and flexibility
  • 24.
    Continuous Ingestion inthe Data Lake 46% Capturing streaming data changes 58% Database migration to cloud 38% Data preparation for analytics and ML jobs 35% We still have a legacy system 46% What are the different change data capture use cases currently deployed in your organization (choose all that apply)?
  • 25.
    ETL, Real-time StreamProcessing and Machine Learning Platform + A Visual IDE for Apache Spark
  • 26.
    CDC with StreamAnalytix Turnkey adaptersfor CDC vendor ETL and data wrangling visual operators Elastic compute ReconcileTransform Enrich Structured data stores CDC streams Unstructured data streams File stores Structured data stores Message queues Hadoop/Hive Cloud storage and DW
  • 27.
    CDC Capabilities inStreamAnalytix Integration with CDC providers
  • 28.
    CDC Capabilities inStreamAnalytix LogMiner integration
  • 29.
    CDC Capabilities inStreamAnalytix Turnkey reconciliation feature for Hadoop offload
  • 30.
    CDC Capabilities inStreamAnalytix Large set of visual operators for ETL, analytics, and stream processing Zero code approach to ETL design Built in NFR support
  • 31.
    StreamAnalytix CDC SolutionDesign StreamAnalytix Workflow A complete CDC solution has three parts: Each aspect of the solution is modelled as a StreamAnalytix pipeline Data de-normalization Join transactional data with data at rest, and store de-normalized data on HDFS Merge previously processed transactional data with new incremental updates Incremental updates in Hive Data ingestion and staging Stream data from Attunity, replicate from Kafka or LogMiner for multiple tables, and store raw data into HDFS
  • 32.
    Pipeline #1: DataIngestion and Staging (Streaming) Data Enrichment Enriches incoming data with metadata information and event timestamp HDFS Stores CDC data on HDFS in landing area using OOB HDFS emitter. HDFS files are rotated based on time and size Data Ingestion via Attunity ‘Channel’ Reads the data from Attunity and target is Kafka. Configured to read data feeds and metadata from a separate topic
  • 33.
    Pipeline #1: DataIngestion and Staging (Streaming) Data Enrichment Enriches incoming data with metadata information and event timestamp HDFS Stores CDC data on HDFS in landing area using OOB HDFS emitter. HDFS files are rotated based on time and size Data Ingestion via Attunity ‘Channel’ Reads the data from Attunity and target is Kafka. Configured to read data feeds and metadata from a separate topic
  • 34.
    Pipeline #1: DataIngestion and Staging (Streaming) Data Enrichment Enriches incoming data with metadata information and event timestamp HDFS Stores CDC data on HDFS in landing area using OOB HDFS emitter. HDFS files are rotated based on time and size Data Ingestion via Attunity ‘Channel’ Reads the data from Attunity and target is Kafka. Configured to read data feeds and metadata from a separate topic
  • 35.
    Pipeline #2: DataDe-normalization (Batch) Performs outer join to merge incremental and static data Store de-normalized data to HDFS directory HDFS Data Channel Ingests incremental data from previous runs of the staging location Reads reference (data at rest) from a fixed HDFS location
  • 36.
    Pipeline #2: DataDe-normalization (Batch) Performs outer join to merge incremental and static data Store de-normalized data to HDFS directory HDFS Data Channel Ingests incremental data from previous runs of the staging location Reads reference (data at rest) from a fixed HDFS location
  • 37.
    Pipeline #2: DataDe-normalization (Batch) Performs outer join to merge incremental and static data Store de-normalized data to HDFS directory HDFS Data Channel Ingests incremental data from previous runs of the staging location Reads reference (data at rest) from a fixed HDFS location
  • 38.
    Pipeline #2: DataDe-normalization (Batch) Performs outer join to merge incremental and static data Store de-normalized data to HDFS directory HDFS Data Channel Ingests incremental data from previous runs of the staging location Reads reference (data at rest) from a fixed HDFS location
  • 39.
    Pipeline #3: IncrementalUpdates in Hive (Batch) Reconciliation Step Hive “merge into” SQL, performs insert, update and delete operation based on the operation in incremental data Clean up step Runs a drop table command on the managed table to clean up processed data – to avoid repeated processing Run a Hive SQL query to load a managed table from the HDFS incremental data generated from Pipeline#2
  • 40.
    Pipeline #3: IncrementalUpdates in Hive (Batch) Reconciliation Step Hive “merge into” SQL, performs insert, update and delete operation based on the operation in incremental data Clean up step Runs a drop table command on the managed table to clean up processed data – to avoid repeated processing Run a Hive SQL query to load a managed table from the HDFS incremental data generated from Pipeline#2
  • 41.
    Pipeline #3: IncrementalUpdates in Hive (Batch) Reconciliation Step Hive “merge into” SQL, performs insert, update and delete operation based on the operation in incremental data Clean up step Runs a drop table command on the managed table to clean up processed data – to avoid repeated processing Run a Hive SQL query to load a managed table from the HDFS incremental data generated from Pipeline#2
  • 42.
    Workflow: Oozie CoordinatorJob Oozie orchestration flow created using StreamAnalytix webstudio – it orchestrates pipeline#2 and pipeline#3 into a single Oozie flow that can be scheduled as shown here
  • 43.
    Demo Data channels • AttunityReplicate and LogMiner Data processing pipeline walkthrough • Data filters and enrichment • Analytics and data processing operators • Data stores
  • 44.
    Summary Do more withyour data acquisition flows • Acquire and process data in real-time • Enrich data from data marts • Publish processed data as it arrives • Multiple parallel processing paths (read once, process multiple times) Move away from fragmented processes • Unify data analytics and data processing/ETL flows
  • 45.
    Conclusion The right nextgen CDC solution can make data ready for analytics as it arrives in near real-time CDC-based data integration is far more complex than the full export and import of your database A unified platform simplifies and reduces the complexities of operationalizing CDC flows
  • 46.
    LIVE Q&A For afree trial download or cloud access visit www.StreamAnalytix.com For any questions, contact us at inquiry@streamanalytix.com

Editor's Notes

  • #3 intro - 5 min Poll - 2 mins   Saurabh Background 2 min Goals 3 min Steps 3 min Methods 3 min Architectural considerations 2 min   Sameer CDC with SAX 3 min Deployment Key Benefits 3 min Demo - 10 min  Beyond CDC - 5 min   Q&A - 10 mins
  • #10 CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
  • #11 https://www.youtube.com/watch?v=v_hQyUZzLsA
  • #12 https://www.youtube.com/watch?v=v_hQyUZzLsA
  • #13 https://www.youtube.com/watch?v=v_hQyUZzLsA
  • #14 https://www.youtube.com/watch?v=v_hQyUZzLsA
  • #15 https://www.youtube.com/watch?v=v_hQyUZzLsA
  • #16 https://www.youtube.com/watch?v=1WrfgBx3hiQ
  • #18 Reference: https://www.slideshare.net/jimdeppen/change-data-capture-13718162 https://www.hvr-software.com/blog/change-data-capture/
  • #22 https://blog.exsilio.com/all/how-to-use-change-data-capture/
  • #23 https://www.youtube.com/watch?v=1WrfgBx3hiQ
  • #24 https://www.youtube.com/watch?v=1WrfgBx3hiQ
  • #36 Join Processor: It. HDFS:.
  • #37 Join Processor: It. HDFS:.
  • #38 Join Processor: It. HDFS:.
  • #39 Join Processor: It. HDFS:.
  • #40  DataGenerator: It generates a dummy record on start of pipeline. LoadStagingInHive: It runs a MergeStagingAndMasterData: It runs a. DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
  • #41  DataGenerator: It generates a dummy record on start of pipeline. LoadStagingInHive: It runs a MergeStagingAndMasterData: It runs a. DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
  • #42  DataGenerator: It generates a dummy record on start of pipeline. LoadStagingInHive: It runs a MergeStagingAndMasterData: It runs a. DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
  • #47 Questions