Moments after you move data into your Hadoop cluster or target database, new transactions on source systems make that data incomplete, and analyses done on that data inaccurate.
However, there are several strategies for keeping data in sync between data platforms.
View this webcast on-demand to learn about the advantages and disadvantages of various change data capture strategies, as well as:
• The latest improvements in Syncsort DMX and DMX-h
• The new DMX Change Data Capture software
• How Syncsort can help you keep your data analytics current and accurate
1. What’s New and Performance Tips
Paige Roberts, Big Data Product Marketing Manager
Ashwin Ramachandran, Big Data Product Manager
2. Agenda
What’s New and Coming Soon in Big Data
• What’s New in DMX/DMX-h version 9.5
• New Product: DMX Change Data Capture – Now GA in version 9.5!
• DataFunnel GUI – Now in beta!
• Lineage
• Big Data Quality
• DMX CDC and MIMIX Share
Strategies for Change Data Capture
• Advantages and Disadvantages of Various Strategies
– Versions, Dates
– Triggers
– Snapshot
– Log
How to Do Change Data Capture with Syncsort Software
• Snapshot-Based CDC with DMX/DMX-h
• Log-Based CDC with DMX Change Data Capture
Where to Find More Info on CDC
2Syncsort Confidential and Proprietary - do not copy or distribute
3. WHAT’S NEW IN DMX/DMX-H
3Syncsort Confidential and Proprietary - do not copy or distribute
4. Combine batch and streaming data sources
Single Interface for Streaming & Batch
Spark 2!
Easy development in GUI No need
to write Scala, C or Java code
Now supports cluster mode!
4
Syncsort Confidential and Proprietary - do not copy or distribute
Simplify Streaming Data Integration
Syncsort Confidential and Proprietary - do not copy or distribute
5. Progress Monitoring
Track the progress of
DMX/DMX-h jobs as they’re
running!
Settable time intervals
See exactly how fast jobs are running
Know how much memory and CPU jobs
use at any point
Know when there’s a problem, even in
the middle of long-running jobs
5Syncsort Confidential and Proprietary - do not copy or distribute
C:PROGRAM FILESDMEXPRESSPROGRAMSdmsmonitor.exe /jobid J_readVSAM_20171006_001743_13572 /task
T_readVSAM /interactive 2 /logdir .
Timestamp: 2017-10-06 00:19:09
Status: RUNNING for 00:01:28
User: aramachandran
Data directory: C:UsersaramachandranDocumentsProjectsCompanyNameVSAM_test
Memory: 32MB
CPU: 12%
/MVS/WWCDMX/AZR.VSM (Source): 7689557 records [1689372 records/sec], 246065824 bytes [5405992 bytes/sec]
Vsam_out.dat (Target): 7685704 records [1687590 records/sec], 245942528 bytes [54002880 bytes/sec]
C:PROGRAM FILESDMEXPRESSPROGRAMSdmsmonitor.exe /jobid J_readVSAM_20171006_001743_13572 /task
T_readVSAM /interactive 2 /logdir .
Timestamp: 2017-10-06 00:19:11
Status: RUNNING for 00:01:30
User: aramachandran
Data directory: C:UsersaramachandranDocumentsProjectsCompanyNameVSAM_test
Memory: 32MB
CPU: 12%
/MVS/WWCDMX/AZR.VSM (Source): 10718776 records [1514609 records/sec], 343000832 bytes [48467504 bytes/sec]
Vsam_out.dat (Target): 10716748 records [1515522 records/sec], 342935936 bytes [48496704 bytes/sec]
6. Access and Integration of Mainframe Data … We’re Simply the Best
6Syncsort Confidential and Proprietary - do not copy or distribute
Save MIPS by processing mainframe data on Hadoop
Read and write Mainframe record formats
– Fixed record length, variable record length, &
variable record length with block descriptor
– Handle complex array structures like ODO’s, even
nested
– Interpret complex copybooks automatically
Write files to local or remote open systems via FTP, SFTP,
Connect:Direct or HDFS
– Connect to external mainframe metadata like
copybooks right on the mainframe with
Connect:Direct
Store an unmodified archive copy for compliance and
lineage tracking
7. Hive Enhancements
Improvements to Hive support
JDBC connectivity
Support for partitioned tables: ORC, Parquet, AVRO, HDFS
Support for Truncate and Insert
Automatic creation of Hive and other Hcat supported tables
Direct distributed processing of Hive
Update of Hive statistics
Use Hive tables for lookups
7Syncsort Confidential and Proprietary - do not copy or distribute
8. Keybreak Processing Made Easy
8Syncsort Confidential and Proprietary - do not copy or distribute
• Running Totals
• Counters
• Group Numbering
10. Get Your Database data into Hadoop, At the Press of a Button
• Funnel hundreds of tables at once into your data lake
‒ Extract, map and move whole DB schemas in one invocation
‒ Extract from Oracle, DB2/z, MS SQL Server, Teradata, Netezza and Redshift
‒ To SQL Server, Postgres, Hive, HDFS, Redshift and Amazon S3
‒ Automatically create target Hive and HCat tables
• Process multiple funnels in parallel on edge node or data nodes
‒ Order data flows by dependencies
‒ Leverage DMX-h high performance data processing engine
• Extract only the data you want
‒ Data type filtering
‒ Table, record or column exclusion / inclusion
• In-flight transformations and cleansing
• User specified access methods: Native, ODBC or JDBC
10
Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
Move thousands of tables in days, not weeks!
11. New User Experience for DataFunnel
11Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
12. New UI Wizard Flow Creation
12Syncsort Confidential and Proprietary - do not copy or distribute
DMX
DataFunnel™
16. Firstly, we configure DMX to access and ingest data
from a JSON source.
Secondly, DMX ingests data from a mainframe in
EBCDIC format.
Finally, DMX then ingests data from an XML source.
DMX then merges these files into
one consistent format.
At the same stage, DMX
produces two exports:
• one simple text/csv output
• a first write to a Hive
database.
DMX then
invokes
TSS to
perform
the Data
Quality
processing
.
Comments
All of these source files have different field structures too.
17. Trillium Quality for Big Data
17Syncsort Confidential and Proprietary - do not copy or distribute
Easily Create Data Quality Workflows Without MapReduce or Spark Coding
Intelligent Execution enables deployment to Hadoop MapReduce and Spark
Verify and enrich global postal addresses using global postal reference sources
Enrich data from external, third-party sources to create comprehensive, unified records, enabling 360-
degree views of the customer and other key business entities
Identify records that belong to the same domain (i.e., household or business)
Parse data values to their correct fields and standardize for better matching
Match like records and eliminate duplicates
18. DMX CHANGE DATA CAPTURE
18Syncsort Confidential and Proprietary - do not copy or distribute
19. Keep Mainframe and Hadoop Data in Sync with Hadoop in Real-Time
Keeps Hadoop data in sync with mainframe changes in real-time
• without overloading networks
• without incurring a high MIPS cost
• without affecting source database performance
• without coding or tuning
Dependable – Reliable transfer of data even
during loss of mainframe connection or Hadoop
cluster failure. Continue from failure point.
Fast – Both Hive data and table statistics
updated in real-time. Does fast update and
insert, even on Hive tables that don’t natively
support it.
Flexible – Works with all Hive tables, including
those backed by text, ORC, Parquet or Avro.
DB2
Syncsort Confidential and Proprietary - do not copy or distribute
DMX Change Data Capture
DB2
20. MIMIX Share Replicates Data in Real Time
Transforms and enhances data during replication
Minimizes bandwidth usage with LAN/WAN friendly replication
Ensures data integrity with conflict resolution and collision
monitoring
Enables tracking and auditing of transactions for compliance
Real-Time
Replication
with Transformation
Change Data
Capture
(CDC)
Conflict Resolution,
Collision Monitoring,
Tracking and Auditing
Source
Database
Target
Database
20
21. STRATEGIES FOR CHANGE DATA CAPTURE
21Syncsort Confidential and Proprietary - do not copy or distribute
22. Why Do Change Data Capture?
Change Data Capture (CDC) is the process that ensures that changes made over
time in one dataset are automatically transferred to the other dataset.
Common data management scenarios where CDC is important:
Enterprise Data Warehouse (EDW)
Business Intelligence (BI)
EDW and/or Mainframe Optimization
Master Data Management
Data Quality
22Syncsort Confidential and Proprietary - do not copy or distribute
23. Different CDC Strategies
Timestamps or Version Numbers
Table Triggers
Snapshot or Table Comparison
Log Scraping
23Syncsort Confidential and Proprietary - do not copy or distribute
24. Advantages and Disadvantages of Timestamp or Version-Based CDC
Advantages
Simple
Nearly every database can query with a
where clause.
24Syncsort Confidential and Proprietary - do not copy or distribute
Disadvantages
Must be built into database
Bloats database size
Query requires considerable compute
resources in source database
Not always reliable
25. Advantages and Disadvantages of Trigger-Based CDC
Advantages
Very reliable and detailed
Changes can be captured, almost as fast as
they are made – real-time CDC.
25Syncsort Confidential and Proprietary - do not copy or distribute
Disadvantages
Significant drag on database resources, both
compute and storage.
Requires that the database have the
capability.
Negative impact on performance of
applications that depend on the source
database.
26. Advantages and Disadvantages of Snapshot-Based CDC
Advantages
Relatively easy to implement with good ETL
software.
Requires no specialized knowledge of the
source database.
Very dependable and accurate.
26Syncsort Confidential and Proprietary - do not copy or distribute
Disadvantages
Requires repeatedly moving all data in
monitored tables. May impact target or
staging system resources and network
bandwidth.
Moving lots of data can be slow, may not
meet SLA’s.
Joining, comparing, and finding changes may
also take time. Even slower.
Not a complete record of intermediate
changes between snapshot captures.
27. Advantages and Disadvantages of Log-Based CDC
Advantages
Very reliable and detailed.
Virtually no impact on database or
application performance.
Changes captured in real-time.
No database bloat.
27Syncsort Confidential and Proprietary - do not copy or distribute
Disadvantages
Every RDMS has a different log format, often
not documented.
Log formats often change between RDBMS
versions.
Log files are frequently archived by the
database. CDC software must read them
before they’re archived, or be able to go
read the archived logs.
Requires specialized CDC software. Cannot
be easily accomplished with ETL software.
28. TWO WAYS SYNCSORT DOES CDC
28Syncsort Confidential and Proprietary - do not copy or distribute
29. How Change Data Capture in DMX/DMX-h Works – Snapshot-based CDC
29Syncsort Confidential and Proprietary - do not copy or distribute
1. Capture: DMX or DMX-h pulls all data
from tables that are being monitored for
change. Syncsort high performance
engine joins new data with previous
snapshot and finds the data changes.
3. Apply: DMX-h applies the
changes to Hive tables, and
updates Hive statistics to
facilitate queries on the new
data.
2. Process: On an edge node in DMX-
h, a CDC Reader consumes a single
raw data stream of the delta data,
and splits it into parallel load streams
for the cluster.
Edge Node or Server
Source
Database
Staged
Data
Snapshot
30. How DMX Change Data Capture Works – Log-based CDC
30Syncsort Confidential and Proprietary - do not copy or distribute
1. Capture: DMX CDC engine scrapes
the DB2 logs and stores only the
delta, the data that has changed,
and flags it as Updated, Deleted or
Inserted. Virtually no MIPS usage.
3. Apply: DMX-h applies the
changes to Hive tables, and
updates Hive statistics to
facilitate queries on the new
data.
2. On an edge node in DMX-h, a
CDC Reader consumes a single
raw data stream of the delta
data, and splits it into parallel
load streams for the cluster.
31. What Next?
31Syncsort Confidential and Proprietary - do not copy or distribute
Find out more about DMX Change Data Capture
http://www.syncsort.com/en/Products/BigData/DMX-Change-Data-Capture
Contact Syncsort sales to get the latest info: http://www.syncsort.com/en/ContactSales