Your SlideShare is downloading. ×
  • Like
  • Save
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Offload the Data Warehouse in the Age of Hadoop

  • 712 views
Published

Join Appfluent and Syncsort to learn why Hadoop means more data savings and less data warehouse. In this presentation, you’ll discover how to easily offload storage and processing to Hadoop to save …

Join Appfluent and Syncsort to learn why Hadoop means more data savings and less data warehouse. In this presentation, you’ll discover how to easily offload storage and processing to Hadoop to save millions.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
712
On SlideShare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Santosh Chitakki, Vice President, Products at Appfluent schitakki@appfluent.com Steve Totman, Director of Strategy at Syncsort @steventotman, stotman@syncsort.com Presentation + Demo
  • 2. The Data Warehouse Vision: A Single Version of The Truth Data Mart Oracle File XML ERP Mainframe Real-Time ETL ETL Enterprise Data Data Mart Warehouse Data Mart 2
  • 3. The Data Warehouse Reality: • Small, sample of structured data, somewhat available • Takes months to make any changes/additions • Costs millions every year Data Mart Oracle File XML ERP ETL ETL Enterprise Data ELT New Reports Data Mart Warehouse Mainframe Dead Data SLA’s Data Mart Real-Time New Column Granular History 3
  • 4. ELT Processing Is Driving Exponential Database Costs The True Cost of ELT Queries (Analytics) $$$      Manual coding/scripting costs Ongoing manual tuning costs Higher storage costs Hurts query performance Hinders business agility Transformations (ELT) And What if…? • Batch window is delayed or needs to be re-run? • Demands increase causing more overlap between queries & batch window? • A critical business requirement results in longer/heavier queries? 4
  • 5. Dormant Data Makes the Problem Even Worse Hot Warm Cold Data Transformations (ELT) of unused data Storage capacity for dormant data  Majority of data in data warehouse is unused/dormant  ETL/ELT processes for unused data unnecessarily consuming CPU capacity  Dormant data consuming unnecessary storage capacity  Eliminate batch loads not needed  Load and store unused data – for active archival in Hadoop 5
  • 6. The Impact of ELT & Dormant Data Missing SLA’s Data Retention Windows Lack of Agility Constant Upgrades • Slow response times • With 40-60% of capacity used for ELT less resources and storage available for end user reports. • Only Freshest Data is stored “on-line” • Historical data archived (as low as 3 months) • Granularity is lost Hot / Warm / Cold / Dead • 6 months (average) to add a new data source / column & generate a new report • Best resources on SQL tuning not new SQL creation. • Data volume growth absorbs all resources to keep existing analysis running / perform upgrades • Exploration of data a wish list item 6
  • 7. Offloading The Data Warehouse to Hadoop Before Data Sources ETL Data Warehouse ETL ELT After Data Sources ETL Business Intelligence Analytic Query & Reporting Data Warehouse ETL / ELT Syncsort Confidential and Proprietary - do not copy or distribute Analytic Query & Reporting 7
  • 8. 20% of ETL Jobs Can Consume of to 80% of Resources ETL is “T” intensive – Sort, Join, Merge, Aggregate, Partition Mappings start simple – Performance demands add complexity – business logic gets “distributed” “Spaghetti” architecture – Impossible to govern – Prohibitively expensive to maintain High Impact start with the greatest pain – focus on the 20% 8
  • 9. The Opportunity Transform the economics of data Cost of managing 1TB of data $15,000 – $80,000 EDW But there’s more… Scalability for longer data retention Performance SLAs Business agility $2000 – $6,000 Hadoop
  • 10. Why Appfluent? Appfluent transforms the economics of Big Data and Hadoop. We are the only company that can completely analyze how data is used to reduce costs and optimize performance.
  • 11. Appfluent Visibility Uncover the Who, What, When, Where, and How of your data Data
  • 12. Why Syncsort? For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data! • Speed leader in Big Data Processing • Fastest sort technology in the market Our customers are achieving the impossible, every day! • Powering 50% of mainframes’ sort • First-to-market, fully integrated approach to Hadoop ETL • A history of innovation • 25+ Issued & Pending Patents • Large global customer base • 15,000+ deployments in 68 countries Key Partners 12
  • 13. Syncsort DMX-h – Enabling the Enterprise Data Hub Blazing Performance. Iron Security. Disruptive Economics • Access – One tool to access all your data, even mainframe • Offload – Migrate complex ELT workloads to Hadoop without coding • Accelerate – Seamlessly optimize new & existing batch workloads in Hadoop PLUS… Access Offload & Deploy Accelerate • Smarter Architecture – ETL engine runs natively within MapReduce • Smarter Productivity – Use Case Accelerators for common ETL tasks • Smarter Security – Enterprise-grade security 13
  • 14. How to Offload Workload & Data 1 • • Identify costly transformations Identify dormant data 2 • • • Rewrite transformations in DMX-H Identify performance opportunities Move dormant data ELT to Hadoop 3 • • Run costliest transformations Store and manage dormant data 4 • Repeat regularly for maximum results
  • 15. 1. Identify Expensive Transformations Unused Data Cold Historical Data Costly End-user Activity • Identify expensive transformations such as ELT to offload to Hadoop. • Identify unused Tables to find useless transformations loading them, move to Hadoop or purge. • Identify unused historical data (by date functions used) and move loading & data to Hadoop. • Discover costly end-user activity and re-direct workloads to Hadoop.
  • 16. Costly End-User Activity Find relevant resource consuming end-user workloads and offload data-sets and activity to Hadoop. Example: Identify SAS data extracts (i.e. SAS queries with with no Where Clause) SAS Data Extracts Identified Consuming 300 hours of server time. Identify data sets associated with data extracts. Replicate identified data in Hadoop and offload associated SAS workload. 16
  • 17. Expensive Transformations Identify expensive transformations such as ELT to offload to Hadoop. ELT process – consuming 65% of CPU Time and 66% of I/O. Drill on process to identify expensive transformations to offload.
  • 18. Unused Data Identify unused Tables to move to Hadoop and offload batch loads for unused data into Hadoop. 87% of Tables Unused. Largest Unused Table (2 billion records). Unused columns in Tables.
  • 19. 2. Access & Move Virtually Any Data One Tool to Quickly and Securely Move All Your Data, Big or Small. No Coding, No Scripting Connect to Any Source & Target • • • RDBMS Mainframe Files • • • Cloud Appliances XML Extract & Load to/from Hadoop • Extract data & load into the cluster natively from Hadoop or execute “off-cluster” on ETL server • Load data warehouses directly from Hadoop. No need for temporary landing areas. PLUS… Mainframe Connectivity • Directly read mainframe data • Parse & translate • Load into HDFS Pre-process & Compress • Cleanse, validate, and partition for parallel loading • Compress for storage savings 19
  • 20. 3. Offload Heavy Transformations to Hadoop Easily Replicate & Optimize Existing Workloads in Hadoop. No Coding. No Scripting.  Develop MapReduce ETL processes without writing code  Leverage existing ETL skills  Develop and test locally in Windows. Deploy in Hadoop  Use Case Accelerators to fast-track development Sort Join + Aggregate Copy Merge  File-based metadata: create once, reuse many times! Development accelerators for CDC and other common data flows 20
  • 21. Demo
  • 22. Appfluent Offload Success Large Financial Organization Situation • IBM DB2 Enterprise Data Warehouse (EDW) growing too quickly • DB2 EDW upgrade/expansion too expensive • Found cost per terabyte of Hadoop is 5x less than DB2 (fully burdened) Solution • Created business program called ‘Data Warehouse Modernization’ • Deployed Cloudera to extend EDW capacity • Used Appfluent to find migration candidates to move to Hadoop Benefits • Capped DB2 EDW at 200TB capacity and not expanded it since • Saved $MM that would have been spent on additional DB2 • Positioned to handle faster rates of data growth in the future
  • 23. Offloading the EDW at Leading Financial Organization Elapsed Time (m) 400 • Offload ELT processing from Teradata into CDH using DMX-h • Implement flexible architecture for staging and change data capture • Ability to pull data directly from Mainframe • No coding. Easier to maintain & reuse • Enable developers with a broader set of skills to build complex ETL workflows HiveQL 300 360 min 200 DMX-h 100 15 min 0 DMXh HiveQL 0 4 4 Man weeks 12 Man weeks 8 12 16 Impact on Loans Application Project:  Cut development time by 1/3  Reduced complexity. From 140 HiveQL scripts to 12 DMX-h graphical jobs  Eliminated need for Java user defined functions  24x faster! Development Effort (Weeks) 23
  • 24. Three Quick Takeaways + 1. ELT and dormant data are driving data warehouse cost and capacity constraints 2. Offloading heavy transformations and “cold” data to Hadoop provides fast savings at minimum risk 3. Follow these 3 steps: a. b. c. Identify dormant data and pinpoint heavy ELT workloads. Focus on top 20% Access and move data to Hadoop Deploy new workloads in Hadoop. 24
  • 25. The Data Warehouse Vision: A Single Version of The Truth Data Mart Oracle File XML ERP Mainframe Real-Time ETL ETL Enterprise Data Data Mart Warehouse Data Mart 25
  • 26. Next Steps Sign up for a Data Warehouse Offload assessment! http://bit.ly/DW-assessment Our experts will help you:  Collect critical information about your EDW environment  Identify migration candidates & determine feasibility  Develop an offload plan & establish business case 26 26