Santosh Chitakki, Vice President, Products at Appfluent
schitakki@appfluent.com
Steve Totman, Director of Strategy at Sync...
The Data Warehouse Vision: A Single Version of The Truth

Data
Mart

Oracle

File
XML
ERP

Mainframe

Real-Time

ETL

ETL
...
The Data Warehouse Reality:
• Small, sample of structured data, somewhat available
• Takes months to make any changes/addi...
ELT Processing Is Driving Exponential Database Costs
The True Cost of ELT
Queries
(Analytics)

$$$







Manual codi...
Dormant Data Makes the Problem Even Worse
Hot

Warm

Cold Data

Transformations (ELT) of unused data
Storage capacity for ...
The Impact of ELT & Dormant Data

Missing
SLA’s
Data
Retention
Windows
Lack of
Agility
Constant
Upgrades

• Slow response ...
Offloading The Data Warehouse to Hadoop

Before

Data Sources

ETL

Data Warehouse

ETL
ELT

After

Data Sources

ETL

Bus...
20% of ETL Jobs Can Consume of to 80% of Resources
ETL is “T” intensive
– Sort, Join, Merge, Aggregate, Partition
Mappings...
The Opportunity

Transform the economics of data
Cost of managing 1TB of data

$15,000 – $80,000

EDW

But there’s more…
S...
Why Appfluent?
Appfluent transforms the economics of Big Data and Hadoop.
We are the only company that can completely anal...
Appfluent Visibility

Uncover the Who, What, When,
Where, and How of your data

Data
Why Syncsort?
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big D...
Syncsort DMX-h – Enabling the Enterprise Data Hub
Blazing Performance. Iron Security. Disruptive Economics
• Access – One ...
How to Offload Workload & Data

1

•
•

Identify costly transformations
Identify dormant data

2

•
•
•

Rewrite transform...
1. Identify

Expensive
Transformations

Unused
Data
Cold
Historical Data

Costly
End-user Activity

• Identify expensive t...
Costly End-User Activity
Find relevant resource consuming end-user workloads and offload
data-sets and activity to Hadoop....
Expensive Transformations
Identify expensive transformations such as ELT to offload to Hadoop.

ELT process – consuming 65...
Unused Data
Identify unused Tables to move to Hadoop and offload batch
loads for unused data into Hadoop.
87% of Tables
Un...
2. Access & Move Virtually Any Data
One Tool to Quickly and Securely Move All Your
Data, Big or Small. No Coding, No Scrip...
3. Offload Heavy Transformations to Hadoop
Easily Replicate & Optimize Existing Workloads in Hadoop.
No Coding. No Scripti...
Demo
Appfluent Offload Success

Large Financial Organization

Situation

• IBM DB2 Enterprise Data Warehouse (EDW) growing too ...
Offloading the EDW at Leading Financial Organization

Elapsed Time (m)

400

• Offload ELT processing from Teradata into
C...
Three Quick Takeaways

+
1. ELT and dormant data are driving data
warehouse cost and capacity constraints
2. Offloading he...
The Data Warehouse Vision: A Single Version of The
Truth

Data
Mart

Oracle

File
XML
ERP

Mainframe

Real-Time

ETL

ETL
...
Next Steps

Sign up for a Data Warehouse Offload assessment!
http://bit.ly/DW-assessment
Our experts will help you:
 Coll...
Upcoming SlideShare
Loading in …5
×

Offload the Data Warehouse in the Age of Hadoop

2,385 views

Published on

Join Appfluent and Syncsort to learn why Hadoop means more data savings and less data warehouse. In this presentation, you’ll discover how to easily offload storage and processing to Hadoop to save millions.

Published in: Technology

Offload the Data Warehouse in the Age of Hadoop

  1. 1. Santosh Chitakki, Vice President, Products at Appfluent schitakki@appfluent.com Steve Totman, Director of Strategy at Syncsort @steventotman, stotman@syncsort.com Presentation + Demo
  2. 2. The Data Warehouse Vision: A Single Version of The Truth Data Mart Oracle File XML ERP Mainframe Real-Time ETL ETL Enterprise Data Data Mart Warehouse Data Mart 2
  3. 3. The Data Warehouse Reality: • Small, sample of structured data, somewhat available • Takes months to make any changes/additions • Costs millions every year Data Mart Oracle File XML ERP ETL ETL Enterprise Data ELT New Reports Data Mart Warehouse Mainframe Dead Data SLA’s Data Mart Real-Time New Column Granular History 3
  4. 4. ELT Processing Is Driving Exponential Database Costs The True Cost of ELT Queries (Analytics) $$$      Manual coding/scripting costs Ongoing manual tuning costs Higher storage costs Hurts query performance Hinders business agility Transformations (ELT) And What if…? • Batch window is delayed or needs to be re-run? • Demands increase causing more overlap between queries & batch window? • A critical business requirement results in longer/heavier queries? 4
  5. 5. Dormant Data Makes the Problem Even Worse Hot Warm Cold Data Transformations (ELT) of unused data Storage capacity for dormant data  Majority of data in data warehouse is unused/dormant  ETL/ELT processes for unused data unnecessarily consuming CPU capacity  Dormant data consuming unnecessary storage capacity  Eliminate batch loads not needed  Load and store unused data – for active archival in Hadoop 5
  6. 6. The Impact of ELT & Dormant Data Missing SLA’s Data Retention Windows Lack of Agility Constant Upgrades • Slow response times • With 40-60% of capacity used for ELT less resources and storage available for end user reports. • Only Freshest Data is stored “on-line” • Historical data archived (as low as 3 months) • Granularity is lost Hot / Warm / Cold / Dead • 6 months (average) to add a new data source / column & generate a new report • Best resources on SQL tuning not new SQL creation. • Data volume growth absorbs all resources to keep existing analysis running / perform upgrades • Exploration of data a wish list item 6
  7. 7. Offloading The Data Warehouse to Hadoop Before Data Sources ETL Data Warehouse ETL ELT After Data Sources ETL Business Intelligence Analytic Query & Reporting Data Warehouse ETL / ELT Syncsort Confidential and Proprietary - do not copy or distribute Analytic Query & Reporting 7
  8. 8. 20% of ETL Jobs Can Consume of to 80% of Resources ETL is “T” intensive – Sort, Join, Merge, Aggregate, Partition Mappings start simple – Performance demands add complexity – business logic gets “distributed” “Spaghetti” architecture – Impossible to govern – Prohibitively expensive to maintain High Impact start with the greatest pain – focus on the 20% 8
  9. 9. The Opportunity Transform the economics of data Cost of managing 1TB of data $15,000 – $80,000 EDW But there’s more… Scalability for longer data retention Performance SLAs Business agility $2000 – $6,000 Hadoop
  10. 10. Why Appfluent? Appfluent transforms the economics of Big Data and Hadoop. We are the only company that can completely analyze how data is used to reduce costs and optimize performance.
  11. 11. Appfluent Visibility Uncover the Who, What, When, Where, and How of your data Data
  12. 12. Why Syncsort? For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data! • Speed leader in Big Data Processing • Fastest sort technology in the market Our customers are achieving the impossible, every day! • Powering 50% of mainframes’ sort • First-to-market, fully integrated approach to Hadoop ETL • A history of innovation • 25+ Issued & Pending Patents • Large global customer base • 15,000+ deployments in 68 countries Key Partners 12
  13. 13. Syncsort DMX-h – Enabling the Enterprise Data Hub Blazing Performance. Iron Security. Disruptive Economics • Access – One tool to access all your data, even mainframe • Offload – Migrate complex ELT workloads to Hadoop without coding • Accelerate – Seamlessly optimize new & existing batch workloads in Hadoop PLUS… Access Offload & Deploy Accelerate • Smarter Architecture – ETL engine runs natively within MapReduce • Smarter Productivity – Use Case Accelerators for common ETL tasks • Smarter Security – Enterprise-grade security 13
  14. 14. How to Offload Workload & Data 1 • • Identify costly transformations Identify dormant data 2 • • • Rewrite transformations in DMX-H Identify performance opportunities Move dormant data ELT to Hadoop 3 • • Run costliest transformations Store and manage dormant data 4 • Repeat regularly for maximum results
  15. 15. 1. Identify Expensive Transformations Unused Data Cold Historical Data Costly End-user Activity • Identify expensive transformations such as ELT to offload to Hadoop. • Identify unused Tables to find useless transformations loading them, move to Hadoop or purge. • Identify unused historical data (by date functions used) and move loading & data to Hadoop. • Discover costly end-user activity and re-direct workloads to Hadoop.
  16. 16. Costly End-User Activity Find relevant resource consuming end-user workloads and offload data-sets and activity to Hadoop. Example: Identify SAS data extracts (i.e. SAS queries with with no Where Clause) SAS Data Extracts Identified Consuming 300 hours of server time. Identify data sets associated with data extracts. Replicate identified data in Hadoop and offload associated SAS workload. 16
  17. 17. Expensive Transformations Identify expensive transformations such as ELT to offload to Hadoop. ELT process – consuming 65% of CPU Time and 66% of I/O. Drill on process to identify expensive transformations to offload.
  18. 18. Unused Data Identify unused Tables to move to Hadoop and offload batch loads for unused data into Hadoop. 87% of Tables Unused. Largest Unused Table (2 billion records). Unused columns in Tables.
  19. 19. 2. Access & Move Virtually Any Data One Tool to Quickly and Securely Move All Your Data, Big or Small. No Coding, No Scripting Connect to Any Source & Target • • • RDBMS Mainframe Files • • • Cloud Appliances XML Extract & Load to/from Hadoop • Extract data & load into the cluster natively from Hadoop or execute “off-cluster” on ETL server • Load data warehouses directly from Hadoop. No need for temporary landing areas. PLUS… Mainframe Connectivity • Directly read mainframe data • Parse & translate • Load into HDFS Pre-process & Compress • Cleanse, validate, and partition for parallel loading • Compress for storage savings 19
  20. 20. 3. Offload Heavy Transformations to Hadoop Easily Replicate & Optimize Existing Workloads in Hadoop. No Coding. No Scripting.  Develop MapReduce ETL processes without writing code  Leverage existing ETL skills  Develop and test locally in Windows. Deploy in Hadoop  Use Case Accelerators to fast-track development Sort Join + Aggregate Copy Merge  File-based metadata: create once, reuse many times! Development accelerators for CDC and other common data flows 20
  21. 21. Demo
  22. 22. Appfluent Offload Success Large Financial Organization Situation • IBM DB2 Enterprise Data Warehouse (EDW) growing too quickly • DB2 EDW upgrade/expansion too expensive • Found cost per terabyte of Hadoop is 5x less than DB2 (fully burdened) Solution • Created business program called ‘Data Warehouse Modernization’ • Deployed Cloudera to extend EDW capacity • Used Appfluent to find migration candidates to move to Hadoop Benefits • Capped DB2 EDW at 200TB capacity and not expanded it since • Saved $MM that would have been spent on additional DB2 • Positioned to handle faster rates of data growth in the future
  23. 23. Offloading the EDW at Leading Financial Organization Elapsed Time (m) 400 • Offload ELT processing from Teradata into CDH using DMX-h • Implement flexible architecture for staging and change data capture • Ability to pull data directly from Mainframe • No coding. Easier to maintain & reuse • Enable developers with a broader set of skills to build complex ETL workflows HiveQL 300 360 min 200 DMX-h 100 15 min 0 DMXh HiveQL 0 4 4 Man weeks 12 Man weeks 8 12 16 Impact on Loans Application Project:  Cut development time by 1/3  Reduced complexity. From 140 HiveQL scripts to 12 DMX-h graphical jobs  Eliminated need for Java user defined functions  24x faster! Development Effort (Weeks) 23
  24. 24. Three Quick Takeaways + 1. ELT and dormant data are driving data warehouse cost and capacity constraints 2. Offloading heavy transformations and “cold” data to Hadoop provides fast savings at minimum risk 3. Follow these 3 steps: a. b. c. Identify dormant data and pinpoint heavy ELT workloads. Focus on top 20% Access and move data to Hadoop Deploy new workloads in Hadoop. 24
  25. 25. The Data Warehouse Vision: A Single Version of The Truth Data Mart Oracle File XML ERP Mainframe Real-Time ETL ETL Enterprise Data Data Mart Warehouse Data Mart 25
  26. 26. Next Steps Sign up for a Data Warehouse Offload assessment! http://bit.ly/DW-assessment Our experts will help you:  Collect critical information about your EDW environment  Identify migration candidates & determine feasibility  Develop an offload plan & establish business case 26 26

×