Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - StreamAnalytix Webinar
The document presents an overview of Change Data Capture (CDC) architecture, detailing its definition, various implementation methods, and use cases in enterprise environments. It highlights the transition from legacy systems to big data integrations, outlines challenges, and describes the architectural considerations for efficient CDC. Additionally, it emphasizes the role of tools like StreamAnalytix in simplifying CDC workflows for real-time analytics and data processing.
Agenda
What is CDC?
Variousmethods for CDC in the enterprise data warehouse
Key considerations for implementing a next-gen CDC architecture
Demo
Q&A
4.
About Impetus
We existto create powerful and intelligent enterprises through deep
data awareness, data integration, and data analytics.
5.
About Impetus
Many ofNorth America’s most respected and well-known brands trust
us as their strategic big data and analytics partner.
6.
Transformation
Legacy EDW to
bigdata/cloud
Unification
Data processing,
preparation, and
access
Analytics
Real-time, machine
learning, and AI
Self-service
BI on big data/
cloud
End-to-End Big Data Solutions
7.
What are thedifferent change data capture use cases currently deployed
in your organization (choose all that apply)?
Continuous Ingestion in the Data Lake
Capturing streaming data changes
Database migration to cloud
Data preparation for analytics and ML jobs
We still have a legacy system
What Does CDCMean for the Enterprise?
Batch
Replicate
Filter Transform
In-memory
Batch
RDBMS
Data
warehouse
Files
RDBMS
Data
warehouse
Hadoop
Streaming
Legacy
Real-time Incremental
16.
Modern CDC Applications
Datalake: Continuous ingestion and pipeline
automation
Streaming: Data changes to Kafka, Kinesis, or
other queues
Cloud: Data workload migration
Business applications: Data preparation for
analytics and ML jobs
Legacy system: Data delivery and query offload
Data lakes
CloudStreaming
Data
warehouse
Files Legacy
RDBMS
17.
Methods of ChangeData Capture
Database triggers Data modification stamping Log based CDC
Date Modification Stamping
Transactionalapplications keep track of metadata in every row
• Tracks when the row was created and last modified
• Enables filter on the DATE_MODIFIED column
Challenges
• There is no DATE_MODIFIED for a deleted row
• Trigger based DATE_MODIFIED
• Extracting is resource intensive
20.
Log Based CDC
Usestransactional logs
Challenges
• Interpreting transaction logs
• No direct interface to transaction log by vendors
• Agents and interfaces change with new database versions
• Supplemental logging increases volume of data
Next Gen ArchitectureConsiderations
Ease of Use
Pre-packaged operators
Extensibility
Modern user experience
Real-time
Change Data Capture
Stream live updates
Optimized for high performance
Hybrid
Multiple vendors
On-premise and cloud
Databases, data warehouse,
and data lake
23.
Value Proposition ofCDC
Incremental update efficiency
Source/production impact
Time to value
TCO
Scale and flexibility
24.
Continuous Ingestion inthe Data Lake 46%
Capturing streaming data changes 58%
Database migration to cloud 38%
Data preparation for analytics and ML jobs 35%
We still have a legacy system 46%
What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?
25.
ETL, Real-time StreamProcessing and Machine Learning Platform
+ A Visual IDE for Apache Spark
26.
CDC with StreamAnalytix
Turnkey
adaptersfor
CDC vendor
ETL and data
wrangling
visual
operators
Elastic
compute
ReconcileTransform Enrich
Structured
data stores
CDC streams
Unstructured
data streams
File stores
Structured
data stores
Message
queues
Hadoop/Hive
Cloud storage
and DW
CDC Capabilities inStreamAnalytix
Turnkey reconciliation feature for Hadoop offload
30.
CDC Capabilities inStreamAnalytix
Large set of visual operators for ETL,
analytics, and stream processing
Zero code approach to ETL design
Built in NFR support
31.
StreamAnalytix CDC SolutionDesign
StreamAnalytix Workflow
A complete CDC solution has three parts:
Each aspect of the solution is modelled as a StreamAnalytix pipeline
Data
de-normalization
Join transactional data
with data at rest, and
store de-normalized data
on HDFS
Merge previously
processed transactional
data with new
incremental updates
Incremental
updates in Hive
Data ingestion
and staging
Stream data from Attunity,
replicate from Kafka or
LogMiner for multiple
tables, and store raw data
into HDFS
32.
Pipeline #1: DataIngestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
33.
Pipeline #1: DataIngestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
34.
Pipeline #1: DataIngestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
35.
Pipeline #2: DataDe-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
36.
Pipeline #2: DataDe-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
37.
Pipeline #2: DataDe-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
38.
Pipeline #2: DataDe-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
39.
Pipeline #3: IncrementalUpdates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
40.
Pipeline #3: IncrementalUpdates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
41.
Pipeline #3: IncrementalUpdates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
42.
Workflow: Oozie CoordinatorJob
Oozie orchestration flow created using StreamAnalytix
webstudio – it orchestrates pipeline#2 and pipeline#3 into a
single Oozie flow that can be scheduled as shown here
43.
Demo
Data channels
• AttunityReplicate and LogMiner
Data processing pipeline walkthrough
• Data filters and enrichment
• Analytics and data processing operators
• Data stores
44.
Summary
Do more withyour data acquisition flows
• Acquire and process data in real-time
• Enrich data from data marts
• Publish processed data as it arrives
• Multiple parallel processing paths (read once, process multiple times)
Move away from fragmented processes
• Unify data analytics and data processing/ETL flows
45.
Conclusion
The right nextgen CDC solution can make data ready for analytics as
it arrives in near real-time
CDC-based data integration is far more complex than the full export
and import of your database
A unified platform simplifies and reduces the complexities of
operationalizing CDC flows
46.
LIVE Q&A
For afree trial download or cloud access visit www.StreamAnalytix.com
For any questions, contact us at inquiry@streamanalytix.com
Editor's Notes
#3 intro - 5 min
Poll - 2 mins
Saurabh
Background 2 min
Goals 3 min
Steps 3 min
Methods 3 min
Architectural considerations 2 min
Sameer
CDC with SAX 3 min
Deployment
Key Benefits 3 min
Demo - 10 min
Beyond CDC - 5 min
Q&A - 10 mins
#10 CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
#40
DataGenerator: It generates a dummy record on start of pipeline.
LoadStagingInHive: It runs a
MergeStagingAndMasterData: It runs a.
DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
#41
DataGenerator: It generates a dummy record on start of pipeline.
LoadStagingInHive: It runs a
MergeStagingAndMasterData: It runs a.
DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
#42
DataGenerator: It generates a dummy record on start of pipeline.
LoadStagingInHive: It runs a
MergeStagingAndMasterData: It runs a.
DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.