Data Warehouse Optimization


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • IN THIS SESSION, WE WILL EXPLORE USING HADOOP TO ADDRESS QUESTIONS AND ISSUES SURROUNDING * Cost of storage * Value of accessibility * Getting maximum return on your IT investments and all of your data
  • Tie workloads to data types
  • Data Warehouse Optimization

    1. 1. Data Warehouse Optimization
    2. 2. 3 Finding Business Pains • Frequent or near-term EDW expansion/spend • Short time windows for data • SLA challenges with ELT • Reports/analytics that are “Too big” • Compliance issues requiring long-term storage AND query • Resource restrictions/contention or disenfranchised/frustrated users 3
    3. 3. 4 Common Challenges with the Data Warehouse 4 OLTP Enterprise Applications Data Warehouse QueryExtract Transform Load Business Intelligence Transform 1 1 1 Slow data transformations, missed SLAs. 2 2 Slow queries, poor QoS and missed opportunities. 4 Must archive. Archived data can’t provide value. 3 3 Wrong or incomplete, modified copies are made.5 Constant pressure to buy additional warehouse capacity, just to maintain current quality of service. NO room to expand use cases. NO room to innovate.
    4. 4. 5 An EDH Compliments the Data Warehouse 5 OLTP Enterprise Applications Data Warehouse Query Extract Load Business Intelligence Cloudera 3 3 Avoid “spreadmarts” across departments. Transform Query 2 2 Empowered business analysts. 2 1 Data loaded when & where it’s needed. 1 4 Complete view of all your products, customers, etc. 5 Cost effective, infinitely scalable, production ready enterprise data hub for all your data. All data. All users.
    5. 5. 6 Hadoop as a Data Warehouse??? 6
    6. 6. 7 2014 Gartner MQ for Data Warehouse DBMS 7 “A data warehouse DBMS is now expected to coordinate data virtualization strategies, and distributed file and/or processing approaches, to address changes in data management and access requirements.”
    7. 7. 8 Thinking About Optimization
    8. 8. 9 Understanding Benefits for Your Organization 9 • Help You Assess Your Enterprise Data Warehouse Ecosystem • Identify Viable Migration Candidates and Target Reference Architecture • Develop a Project Plan to Deliver the Full Scope of Benefits • Understand the Business Case for Making the Investment
    9. 9. 10 Working With You Through the EDW Assessment Process 10 Information •Collect information about your EDW environment Analysis •Identify migration candidates •Determine feasibility Recommendations •Develop a migration plan •Establish a business case
    10. 10. 11 Identifying Sources and Workloads
    11. 11. 12 Key Hadoop Platform Requirements • High availability • Disaster recovery • Downtime-less upgrades • Auditability • Low-latency SQL & BI support • Deep SAS & R support
    12. 12. 13 Customers Agree: Cloudera Delivers Customer Workload Results Leading Payments Company Analytics, ETL Processing, DR Largest fraud discovery in firm history Time to report collapsed from 2 days => 2 hours Save $30M on DR Global Money Center Bank Data Processing (ELT) Avoided tens of millions in expansion purchases 42% faster processing Mobile Device Manufacturer Data Processing (ELT) Offloaded 90% of data volume; keep all data Fortune 500 Retailer Analytics More insights by supporting more exploration of more extensive & granular data Leading Financial Regulator Data Processing (ELT) and DR Shrank EDW footprint by 4PB, 20X perf. boost
    13. 13. 14 DATA WAREHOUSE Operational Business Intelligence Analytics Self-Service BI Data Processing (ELT) Staged Data Operational Data Archival Data WORKLOADSDATA Assessing Workloads and Data • Data Processing (ELT) • Staged data, to be processed • Temp tables, BLOB/CLOB types, etc. • Analytics / Machine Learning • Deep and broad data sets, within and beyond the warehouse • Self-Service BI (Ad-Hoc Query) • Operational data, actively used for BI • Archival data, inactively used for BI
    14. 14. 15 Offload Data Processing (ELT) High-scale batch data processing Implemented as SQL + scripting or ETL running on expensive HW infrastructure Staging data stored across diverse, temp tables High fraction of overall EDW utilization (25 – 80%) Difficult to store, manage staging data in relational form Limited user adoption risk to migrate ETL tools to simplify migration Over 2X the performance 1/10th the cost What to Migrate Influencing Factors Better in Cloudera Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades Low-latency SQL processing, ability to absorb short-cycle ELT Broad support of leading data integration tools Only Available with Cloudera Partners
    15. 15. 16 Offload Self-Service Business Intelligence Self-Service BI, Exploratory BI, Data Discovery Uncertain business questions and uncertain data Fastest growing workload for many warehouses Comparable support for end user tools between Cloudera and DBMS products Schema flexibility End user self-service on full fidelity data 1/10th the cost Workload Migration Priority Better In Cloudera Open source parallel interactive SQL engine: Cloudera Impala Integration and certification of every leading SSBI vendor Only Available with Cloudera Partners
    16. 16. 17 Offload Analytics / Machine Learning Training & scoring predictive models Deep and broad data sets, within and beyond the warehouse Statisticians want unconstrained analysis; limited DW compute resources Paying top dollar for warehouse data storage only to load into ML tools Inability to analyze data beyond the warehouse Greater user productivity (pre-packaged ML libraries, no more down-sampling) Support for 3rd party ML tools Greater flexibility (SQL + MR + SAS procs) 1/10th the cost Workload and Data Influencing Factors Better in Cloudera Ability to run SAS, R natively on the same cluster Interactive search and SQL experience for data exploration Built-in analytics libraries (Mahout, DataFu, ClouderaML) Support from Cloudera’s Data Science team Only Available with Cloudera Partners
    17. 17. 18 Sample Cloudera Tools for Assisting Migration • High-speed connector – Moves data between the two systems • Data definition – Tool for mapping EDW tables & datatypes to Hive tables & datatypes • Mainframe input / output format – Support direct feed of mainframe data into Cloudera • Result validation – Verifies SQL applications in Cloudera produce the same results as the original applications • Support for SQL-H (planned) – Remote queries from EDW to Cloudera 18
    18. 18. 19 Groundwork for Optimization
    19. 19. 20 • Install and configure CDH and Cloudera Manager • Run standard and specialized performance tests • Recommend tuning, compression and decompression, and scheduler configurations • Document recommended cluster configuration • Train and certify Hadoop administrators Is Your Data Architecture Aligned to Your Use Case? Lay the Foundation for Data Migration and Ensure Success
    20. 20. 21 How Quickly and Securely Can You Transition Your Data? Migrate Disparate Data Sources to Boost Performance • Collect low-efficiency data from various silos • Redeploy latent data from EDWs, RDBMSs, and Hadoop environment • Develop, test, and implement data processing jobs • Integrate Hadoop with relevant external systems • Document workload migration
    21. 21. 22 Is Your Operational Environment Ready for Handover? Maximize ROI by Rationalizing All Systems, Teams, and Workloads • Review current and future requirements • Review full ecosystem, all jobs, and regular processes • Review application architecture, ingestion pipeline, data schema, and data partitioning system • Review key management and monitoring processes and relevant production procedures • Recommend additional training to assure Hadoop expertise on management and operations teams • Document cluster configuration, solutions implementation, and production recommendations
    22. 22. 23 How Much Additional Value Can You Capture Long-Term? Ongoing Optimization Is Key to Deferring Additional Cost • Expand framework without expanding footprint • Rationalize beyond initial burn-in period • Evolve cluster to support additional use cases • Annually benchmark performance to diagnostic • Balance business opportunity against technical risk
    23. 23. 24 Building the Optimization Plan
    24. 24. 25 Prioritizing Workloads and Data Current EDW Constraints Workload Transferability User Communities • Focus on computation constraints • Focus on disk space constraints • Similar or same SQL functionality • Similar or same tools support • Opportunity for performance gains • Group related workloads by user community • Migrate one community at a time 1 2 3
    25. 25. 26 The Optimization Process Profile Prioritize Migrate Validate • Analyze all of the workload in your data warehouse • Queries • Objects • User communities • Framework driven methodology for ordering workloads • Balance financial opportunity with business risk • Set up data ingest paths to Cloudera • Map EDW workload to Cloudera Repeat annually to defer additional expansion • Verify results • Evaluate performance differences & tune • Side-by-side “burn in” period • Cut-over
    26. 26. 27 Sample EDW Rationalization Process Initial Quarter Second Quarter Third Quarter Fourth Quarter M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 Program Management Responsible for overall program success, resource assignment, project management, and risk mitigation Cloudera Migration Teams Expert resources delivering initial project framework and advanced implementation releases ${Customer} Migration Teams Customer staff resources, taking on increasing responsibility for release implementation over time ProcessPeople Technology Management & Risk Mitigation Initial EDW Assessment Architecture Oversight Assessment and Stratification Process Detailed Workload Analysis Implement Reference Architecture Establish Repeatable Migration Approach Enhance SDLC, Release, and Configuration Management Processes Release 1 Release 2 Release 3 Release N Migration SDLC Assignment/Kick-off Execution Testing User Acceptance Documentation Sign-off Release 2 Release 3 Release N Release 4 Release 5
    27. 27. 28 Workload Classification Cloudera Architecture Implementing Cloudera’s reference architecture(s) and building environment to fit unique customer requirements Data Ecosystem Integration BI, ETL, and other applications that require integration with the big data platform, including existing EDW Data Processing High-scale batch data processing, Implemented as SQL + scripting or via ETL tools, Staging data stored across diverse, temp tables Self-service BI Exploratory BI, Data Discovery, Uncertain business questions and uncertain data Analytics Training & scoring, predictive models, deep and broad data sets (within and beyond the warehouse) Archival Processes Traditional archive storage and processes
    28. 28. 29 Workload Complexity Basic • Leverages pre-existing architecture and integrations • Utilizes all off-the-shelf components • Repeatable solutions from existing training/documentation Moderate • Requires minimal modifications to existing architecture, integrations, or other dependencies • Some expertise required for new design decisions Advanced • Establishing new reference architectures • Several new design decisions involved • Unique skillsets required (eg. Machine learning)
    29. 29. 30 Sample Complexity vs. Time for Various Project Types ComplexityofTask Estimated Phase Low Moderate High 1 2 3 4 Machine Learning Modeling Graph Analytics Modeling Hadoop cluster install/config One-off ingest/ETL processes Predictive Analytics Modeling Production Certification Hadoop storage schemas Decision tree/forest/ensemble Data Pipelining Generic ingest/ETL processes
    30. 30. 31 Mapping Resources to Project Task Type ComplexityofTask Estimated Phase Low Moderate High 1 2 3 4 Data Scientist Senior Architect Consultant Architect Principal Architect
    31. 31. 32 Developers AdminData Warehouse Specialist Architects Technology & Ops Management & Leadership Big Data Visionary Executive Sponsor Program Manager Business & Data Lead Data Scientist Lead Business Analyst LOB Rep LOB Rep LOB Rep Data Wranglers Typical Big Data COE Program Roles Staff Centrally and Train to Scale
    32. 32. 33 Benefits Summary 1. Lower costs of data management, growth 2. Improve quality of service • Meet critical data processing SLAs • Faster BI queries 3. Extend existing warehouse capacity • Increase ROI from current investments • More operational data – volume and schemas • More business intelligence and analytics workloads 4. Retain all data for analysis 5. Deliver a foundation for innovation • Bring more applications to Hadoop data for low incremental cost
    33. 33. 34 The Experts Agree 34
    34. 34. 35 Questions?