Join Appfluent and Syncsort to learn why Hadoop means more data savings and less data warehouse. In this presentation, you’ll discover how to easily offload storage and processing to Hadoop to save millions.
Santosh Chitakki, Vice President, Products at Appfluent
Steve Totman, Director of Strategy at Syncsort
Presentation + Demo
The Data Warehouse Vision: A Single Version of The Truth
The Data Warehouse Reality:
• Small, sample of structured data, somewhat available
• Takes months to make any changes/additions
• Costs millions every year
ELT Processing Is Driving Exponential Database Costs
The True Cost of ELT
Manual coding/scripting costs
Ongoing manual tuning costs
Higher storage costs
Hurts query performance
Hinders business agility
And What if…?
• Batch window is delayed or needs to be re-run?
• Demands increase causing more overlap between queries & batch window?
• A critical business requirement results in longer/heavier queries?
Dormant Data Makes the Problem Even Worse
Transformations (ELT) of unused data
Storage capacity for dormant data
Majority of data in data warehouse is unused/dormant
ETL/ELT processes for unused data unnecessarily consuming CPU capacity
Dormant data consuming unnecessary storage capacity
Eliminate batch loads not needed
Load and store unused data – for active archival in Hadoop
The Impact of ELT & Dormant Data
• Slow response times
• With 40-60% of capacity used for ELT less
resources and storage available for end user
• Only Freshest Data is stored “on-line”
• Historical data archived (as low as 3 months)
• Granularity is lost Hot / Warm / Cold / Dead
• 6 months (average) to add a new data
source / column & generate a new report
• Best resources on SQL tuning not new SQL
• Data volume growth absorbs all resources to
keep existing analysis running / perform
• Exploration of data a wish list item
Offloading The Data Warehouse to Hadoop
ETL / ELT
Syncsort Confidential and Proprietary - do not copy or distribute
Analytic Query & Reporting
20% of ETL Jobs Can Consume of to 80% of Resources
ETL is “T” intensive
– Sort, Join, Merge, Aggregate, Partition
Mappings start simple
– Performance demands add complexity
– business logic gets “distributed”
– Impossible to govern
– Prohibitively expensive to maintain
High Impact start with the greatest pain – focus on
Transform the economics of data
Cost of managing 1TB of data
$15,000 – $80,000
But there’s more…
Scalability for longer data retention
Appfluent transforms the economics of Big Data and Hadoop.
We are the only company that can completely analyze how
data is used to reduce costs and optimize performance.
Uncover the Who, What, When,
Where, and How of your data
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big Data!
• Speed leader in Big Data Processing
• Fastest sort technology in the
Our customers are achieving the
impossible, every day!
• Powering 50% of mainframes’ sort
• First-to-market, fully integrated
approach to Hadoop ETL
• A history of innovation
• 25+ Issued & Pending Patents
• Large global customer base
• 15,000+ deployments in 68 countries
Syncsort DMX-h – Enabling the Enterprise Data Hub
Blazing Performance. Iron Security. Disruptive Economics
• Access – One tool to access all your
data, even mainframe
• Offload – Migrate complex ELT
workloads to Hadoop without coding
• Accelerate – Seamlessly optimize new
& existing batch workloads in Hadoop
• Smarter Architecture – ETL engine
runs natively within MapReduce
• Smarter Productivity – Use Case
Accelerators for common ETL tasks
• Smarter Security – Enterprise-grade
How to Offload Workload & Data
Identify costly transformations
Identify dormant data
Rewrite transformations in DMX-H
Identify performance opportunities
Move dormant data ELT to Hadoop
Run costliest transformations
Store and manage dormant data
Repeat regularly for maximum
• Identify expensive transformations such as
ELT to offload to Hadoop.
• Identify unused Tables to find useless
transformations loading them, move to Hadoop or
• Identify unused historical data (by date functions
used) and move loading & data to Hadoop.
• Discover costly end-user activity and re-direct
workloads to Hadoop.
Costly End-User Activity
Find relevant resource consuming end-user workloads and offload
data-sets and activity to Hadoop.
Example: Identify SAS data extracts (i.e. SAS
queries with with no Where Clause)
SAS Data Extracts Identified
Consuming 300 hours
of server time.
Identify data sets associated with data
extracts. Replicate identified data in
Hadoop and offload associated SAS
Identify expensive transformations such as ELT to offload to Hadoop.
ELT process – consuming 65% of
CPU Time and 66% of I/O.
Drill on process to identify expensive
transformations to offload.
Identify unused Tables to move to Hadoop and offload batch
loads for unused data into Hadoop.
87% of Tables
Largest Unused Table
(2 billion records).
Unused columns in Tables.
2. Access & Move Virtually Any Data
One Tool to Quickly and Securely Move All Your
Data, Big or Small. No Coding, No Scripting
Connect to Any Source & Target
Extract & Load to/from Hadoop
• Extract data & load into the cluster natively from
Hadoop or execute “off-cluster” on ETL server
• Load data warehouses directly from Hadoop. No
need for temporary landing areas.
PLUS… Mainframe Connectivity
• Directly read mainframe data
• Parse & translate
• Load into HDFS
Pre-process & Compress
• Cleanse, validate, and partition for parallel
• Compress for storage savings
3. Offload Heavy Transformations to Hadoop
Easily Replicate & Optimize Existing Workloads in Hadoop.
No Coding. No Scripting.
Develop MapReduce ETL processes
without writing code
Leverage existing ETL skills
Develop and test locally in Windows.
Deploy in Hadoop
Use Case Accelerators to fast-track
File-based metadata: create once, reuse many times!
Development accelerators for CDC
and other common data flows
Appfluent Offload Success
Large Financial Organization
• IBM DB2 Enterprise Data Warehouse (EDW) growing too quickly
• DB2 EDW upgrade/expansion too expensive
• Found cost per terabyte of Hadoop is 5x less than DB2 (fully burdened)
• Created business program called ‘Data Warehouse Modernization’
• Deployed Cloudera to extend EDW capacity
• Used Appfluent to find migration candidates to move to Hadoop
• Capped DB2 EDW at 200TB capacity and not expanded it since
• Saved $MM that would have been spent on additional DB2
• Positioned to handle faster rates of data growth in the future
Offloading the EDW at Leading Financial Organization
Elapsed Time (m)
• Offload ELT processing from Teradata into
CDH using DMX-h
• Implement flexible architecture for staging
and change data capture
• Ability to pull data directly from Mainframe
• No coding. Easier to maintain & reuse
• Enable developers with a broader set of skills
to build complex ETL workflows
4 Man weeks
12 Man weeks
Impact on Loans Application Project:
Cut development time by 1/3
Reduced complexity. From 140 HiveQL scripts to 12
DMX-h graphical jobs
Eliminated need for Java user defined functions
Development Effort (Weeks)
Three Quick Takeaways
1. ELT and dormant data are driving data
warehouse cost and capacity constraints
2. Offloading heavy transformations and “cold”
data to Hadoop provides fast savings at
3. Follow these 3 steps:
Identify dormant data and pinpoint heavy
ELT workloads. Focus on top 20%
Access and move data to Hadoop
Deploy new workloads in Hadoop.
The Data Warehouse Vision: A Single Version of The
Sign up for a Data Warehouse Offload assessment!
Our experts will help you:
Collect critical information about your EDW environment
Identify migration candidates & determine feasibility
Develop an offload plan & establish business case