Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar

©2018 Impetus Technologies, Inc. All rights reserved.
You are prohibited from making a copy or modification of, or from redistributing,
rebroadcasting, or re-encoding of this content without the prior written consent of
Impetus Technologies.
This presentation may include images from other products and services. These
images are used for illustrative purposes only. Unless explicitly stated there is no
implied endorsement or sponsorship of these products by Impetus Technologies. All
copyrights and trademarks are property of their respective owners.

Planning your Next-Gen Change Data Capture (CDC) Architecture
December 19, 2018

Agenda
What is CDC?
Various methods for CDC in the enterprise data warehouse
Key considerations for implementing a next-gen CDC architecture
Demo
Q&A

About Impetus
We exist to create powerful and intelligent enterprises through deep
data awareness, data integration, and data analytics.

About Impetus
Many of North America’s most respected and well-known brands trust
us as their strategic big data and analytics partner.

Transformation
Legacy EDW to
big data/cloud
Unification
Data processing,
preparation, and
access
Analytics
Real-time, machine
learning, and AI
Self-service
BI on big data/
cloud
End-to-End Big Data Solutions

What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?
Continuous Ingestion in the Data Lake
Capturing streaming data changes
Database migration to cloud
Data preparation for analytics and ML jobs
We still have a legacy system

Our Speakers Today
SAURABH DUTTA
Technical Product Manager
SAMEER BHIDE
Senior Solutions Architect

What is Change Data Capture (CDC)?
CDC is the process of capturing changes made at the data source and
applying them throughout the enterprise.

Let’s Take a Closer Look
Source Database Target Database

WebApp

Customer {
Telephone: “111”
}
Customer {
}
Create
WebApp

Change Data
Capture Event
Customer {
}
Customer {
}
Customer {
}
Create
WebApp

Change Data
Capture Event
Customer {
}
Customer {
}
Customer {
}
Update
WebApp

What Does CDC Mean for the Enterprise?
Batch
Replicate
Filter Transform
In-memory
Batch
RDBMS
Data
warehouse
Files
RDBMS
Data
warehouse
Hadoop
Streaming
Legacy
Real-time Incremental

Modern CDC Applications
Data lake: Continuous ingestion and pipeline
automation
Streaming: Data changes to Kafka, Kinesis, or
other queues
Cloud: Data workload migration
Business applications: Data preparation for
analytics and ML jobs
Legacy system: Data delivery and query offload
Data lakes
CloudStreaming
Data
warehouse
Files Legacy
RDBMS

Methods of Change Data Capture
Database triggers Data modification stamping Log based CDC

Database Triggers
Uses shadow tables
Challenges
• Introduces overhead
• Increases load to retrieve
• Loses intermediate changes

Date Modification Stamping
Transactional applications keep track of metadata in every row
• Tracks when the row was created and last modified
• Enables filter on the DATE_MODIFIED column
Challenges
• There is no DATE_MODIFIED for a deleted row
• Trigger based DATE_MODIFIED
• Extracting is resource intensive

Log Based CDC
Uses transactional logs
Challenges
• Interpreting transaction logs
• No direct interface to transaction log by vendors
• Agents and interfaces change with new database versions
• Supplemental logging increases volume of data

Run initial
load
Incremental
updates
Change Data Capture Implementation Steps
Enable CDC
for database
Define
target
Table to
handle CDC
states
Prepare table
for CDC

Next Gen Architecture Considerations
Ease of Use
Pre-packaged operators
Extensibility
Modern user experience
Real-time
Change Data Capture
Stream live updates
Optimized for high performance
Hybrid
Multiple vendors
On-premise and cloud
Databases, data warehouse,
and data lake

Value Proposition of CDC
Incremental update efficiency
Source/production impact
Time to value
TCO
Scale and flexibility

Continuous Ingestion in the Data Lake 46%
Capturing streaming data changes 58%
Database migration to cloud 38%
Data preparation for analytics and ML jobs 35%
We still have a legacy system 46%
What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?

ETL, Real-time Stream Processing and Machine Learning Platform
+ A Visual IDE for Apache Spark

CDC with StreamAnalytix
Turnkey
adapters for
CDC vendor
ETL and data
wrangling
visual
operators
Elastic
compute
ReconcileTransform Enrich
Structured
data stores
CDC streams
Unstructured
data streams
File stores
Structured
data stores
Message
queues
Hadoop/Hive
Cloud storage
and DW

CDC Capabilities in StreamAnalytix
Integration with CDC providers

LogMiner integration

Turnkey reconciliation feature for Hadoop offload

Large set of visual operators for ETL,
analytics, and stream processing
Zero code approach to ETL design
Built in NFR support

StreamAnalytix CDC Solution Design
StreamAnalytix Workflow
A complete CDC solution has three parts:
Each aspect of the solution is modelled as a StreamAnalytix pipeline
Data
de-normalization
Join transactional data
with data at rest, and
store de-normalized data
on HDFS
Merge previously
processed transactional
data with new
incremental updates
Incremental
updates in Hive
Data ingestion
and staging
Stream data from Attunity,
replicate from Kafka or
LogMiner for multiple
tables, and store raw data
into HDFS

Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic

Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location

Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2

Workflow: Oozie Coordinator Job
Oozie orchestration flow created using StreamAnalytix
webstudio – it orchestrates pipeline#2 and pipeline#3 into a
single Oozie flow that can be scheduled as shown here

Demo
Data channels
• Attunity Replicate and LogMiner
Data processing pipeline walkthrough
• Data filters and enrichment
• Analytics and data processing operators
• Data stores

Summary
Do more with your data acquisition flows
• Acquire and process data in real-time
• Enrich data from data marts
• Publish processed data as it arrives
• Multiple parallel processing paths (read once, process multiple times)
Move away from fragmented processes
• Unify data analytics and data processing/ETL flows

Conclusion
The right next gen CDC solution can make data ready for analytics as
it arrives in near real-time
CDC-based data integration is far more complex than the full export
and import of your database
A unified platform simplifies and reduces the complexities of
operationalizing CDC flows

LIVE Q&A
For a free trial download or cloud access visit www.StreamAnalytix.com
For any questions, contact us at inquiry@streamanalytix.com

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar

Similar to Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar (20)

More from Impetus Technologies

More from Impetus Technologies (19)

Recently uploaded

Recently uploaded (20)

Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar

Editor's Notes