Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enterprise Data Warehouse Optimization: 7 Keys to Success

2,080 views

Published on

You have a legacy system that no longer meet the demands of your current data needs, and replacing it isn’t an option. But don’t panic: Modernizing your traditional enterprise data warehouse is easier than you may think.

Published in: Technology
  • Be the first to comment

Enterprise Data Warehouse Optimization: 7 Keys to Success

  1. 1. 1 © Hortonworks Inc. 2011 –2016. All Rights Reserved1 © Hortonworks Inc. 2011 –2017. All Rights Reserved Scott Gnau CTO, Hortonworks @Scott_Gnau David Loshin, President, Knowledge Integrity loshin@knowledge-integrity.com
  2. 2. Legacy Architectures Impede Performance EDW Capital Costs Operations Costs Scalability Analytic Flexibility Time to Value Data Quality Data Variety © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 2 • Data warehouse performance is no longer solely defined in terms of computation speed • Optimal performance reflects the ability to maximize value across a range of dimensions • The static design of legacy platforms has not kept pace with growing desire for business intelligence and analytics
  3. 3. Step 1: Leverage Horizontal Scalability • DW appliances require significant capital investment – System must be sized to meet anticipated needs – Allows for unused capacity at beginning – Requires increased “step-up” investments on regular intervals • Hadoop finesses this challenge – Relies on commodity components – Start with what you need, grow with increased demand – Introduce newer hardware seamlessly – Exploit innovations to speed performance (e.g., Stinger.next, Low Latency Analytical Processing) © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 3 Rack switch NameNode DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker Rack switch NameNode DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker Rack switch NameNode DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker Rack switch NameNode DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker DataNode & TaskTracker
  4. 4. Step 2: Augment EDW Storage with Hive • The value of existing EDW investments can be extended using a Hybrid Architecture • Hive continues to evolve with innovative performance improvements: – In-memory caching and persistent query executors – Column-oriented distributed data organization – Improved security using Apache Ranger – SQL ACID Merge © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 4 Hadoop Cluster EDW
  5. 5. Step 3: Increase Data Flexibility • Conventional data warehouse architectures are organized using a dimensional model – Facts represent events – Dimensions characterize the facts • The dimensional model is suited to typical DW operations – Aggregation and rolled-up reporting – “Slice and dice” • However, this model forces all data into predetermined schema (“schema-on-write”) – Introduces bias, creates constraints and limits data flexibility • Alternative: schema-on-read – Data sets are captured in their source formats – Frees data consumers to apply their own organization – Allows logical structure to be layered on top of data in source format – Enables use of creative algorithms for analytics, text mining, and machine learning © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 5
  6. 6. Step 4: Use Unstructured Data • Data warehouses are engineered around structured data • Many sources of increasing volume of unstructured data – Apps running on Internet-connected devices generate text streams – Machine-generated unstructured content – Semi-structured sources • Applications that consume both structured and unstructured data provide fuller visibility into analytical results • Tools like Lucene, Solr, Mahout, and other text analytics libraries help to parse and tag unstructured text © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 6 Ingest Parse Tag Organize Lucene Solr Mahout
  7. 7. Step 5: Data Discovery © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 7 Data Ingestion & Transformation • Data imported into the data warehouse is homogenized and organized within predefined data models • This constrains downstream consumers
  8. 8. Step 5: Data Discovery © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 8 Data Discovery & Preparation Data Discovery & Preparation Data Discovery & Preparation Data Discovery & Preparation Data Discovery & Preparation • Data discovery allows each user to configure the data for their specialized purposes
  9. 9. Step 6: Offload ETL to Hadoop • 60-70% of the effort of data warehousing is attributed to extraction, transformation, and loading (ETL) • Hadoop is a natural platform for ETL processing: – ETL is inherently data parallel, enabling faster execution – Development time can be drastically reduced with faster dev/test/debug cycle – Resources can be dynamically apportioned and released when ETL processing is completed, lowering costs • Apache Hive supports SQL ACID Merge which handles inserts, updates, and deletes in a single pass • Allows for in-database transformations without need for massive refreshes © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 9
  10. 10. Step 7: Operational Data Governance • Delegating more responsibility to the consumer community poses a risk of inconsistent interpretation and use • Institute operational data governance to support versioning, lineage, and provenance – Metadata management – Data lineage – Archiving policies – Versioning policies – Data security and protection • Apache Atlas is an open source component of the Hadoop ecosystem that captures data definitions, hierarchical taxonomies, data elements and their relationships, and lineage © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 10
  11. 11. Modernization: Evolving the Hybrid EDW • Conventional RDBMS-based data warehouses have served organizations well, but are being eclipsed by newer technologies • Scalable systems built on commodity components are rapidly being adopted for business intelligence and analytics applications • Optimize the EDW using an evolutionary approach to embracing Hadoop: – Expand the storage footprint – Increase computational power – Broaden the scope of application support – Lower costs © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 11
  12. 12. Questions & Suggestions • www.knowledge-integrity.com • www.dataqualitybook.com • www.decisionworx.com • If you have questions, comments, or suggestions, please contact me David Loshin 301-754-6350 loshin@knowledge-integrity.com © 2017 Knowledge Integrity, Inc loshin@knowledge-integrity.com (301) 754-6350 12
  13. 13. 13 © Hortonworks Inc. 2011 –2016. All Rights Reserved The Next Gen EDW is the Big Data Warehouse à In Forrester’s 2016 global survey, 59% of respondents stated that leveraging big data and analytics was a critical or high priority.
  14. 14. 14 © Hortonworks Inc. 2011 –2016. All Rights Reserved Companies Are Looking to Big Data for EDW Optimization à 82% of 2550+ respondents are looking to Big Data for EDW Optimization rather than a straight replacement. – 2016 Big Data Maturity Survey
  15. 15. 15 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hortonworks Connected Data Platforms and Solutions Hortonworks Connection Hortonworks Solutions Enterprise Data Warehouse Optimization Cyber Security and Threat Management Internet of Things and Streaming Analytics Hortonworks Connection Subscription Support SmartSense Premier Support Educational Services Professional Services Community Connection Cloud Hortonworks Data Cloud AWS HDInsight Data Center Hortonworks Data Suite HDFHDP
  16. 16. 16 © Hortonworks Inc. 2011 –2016. All Rights Reserved Drivers of a Modern BI Infrastructure Deeper and Broader Data Sets Complete Data ‘Provenance’ Leading Analytics and Tools Integrate non-EDW data and EDW data Total Cost of Ownership
  17. 17. 17 © Hortonworks Inc. 2011 –2016. All Rights Reserved Open Source Transformational Impact to EDW Unmatched Economics support low cost data-center and cloud architectures for Enterprise Apache Hadoop Eliminates Risk and Ensures Integration prevents vendor lock-in and speeds ecosystem adoption of ODPi-compliant core COST EFFICIENCY DATA VARIETY EDW PROPRIETARY HADOOP HORTONWORKS OPEN SOURCE RDBMS
  18. 18. 18 © Hortonworks Inc. 2011 –2016. All Rights Reserved But, why aren’t more companies running to this solution? Risky Hadoop requires a bunch of new skill sets It’ll take a long time There’s too much manual coding required It’s hard to integrate to my BI tool stack
  19. 19. 19 © Hortonworks Inc. 2011 –2016. All Rights Reserved Legacy EDW Solution
  20. 20. 20 © Hortonworks Inc. 2011 –2016. All Rights Reserved Using Hadoop to Optimize the Data Warehouse à Augment EDW with Hive à Offload ETL to Hadoop à Data Governance
  21. 21. 21 © Hortonworks Inc. 2011 –2016. All Rights Reserved Augment current EDW with Hive Hive LLAP GA: Interactive query in seconds, 10X fast join performance Ease of Use and Adoption : SQL Standard ACID Merge Enterprise Readiness: Supports all TPC-DS Queries Streamlined Operations: Hive Views
  22. 22. 22 © Hortonworks Inc. 2011 –2016. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup (x Factor) Query Time(s) (Lower is Better) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 26x Performance Boost at 1TB Scale
  23. 23. 23 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hive LLAP in HDP 2.6: Stable Performance with High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  24. 24. 24 © Hortonworks Inc. 2011 –2016. All Rights Reserved Offload ETL to Hadoop à The Problem: – EDWs can consume between 50% and 90% of resources just on ETL/ELT tasks. – These jobs interfere with more business- critical tasks like BI and advanced analytics. à The Solution: – Hive and HDP deliver ETL that scales to petabytes. – Economical scale-out processing on commodity servers. à The Result: – Better SLAs for mission-critical analytics. – Limit EDW expansion or retire old systems. ETL/ELT DATA MART DATA LANDING & DEEP ARCHIVE CUBE MART END USER APPLICATIONS APPLICATIONS APPLICATIONS END USERS AND APPS
  25. 25. 25 © Hortonworks Inc. 2011 –2016. All Rights Reserved Data Governance for EDW Optimization Classification Prohibition Time Location Policies PDP Resource Cache Ranger Manage Access Policies and Audit Logs Track Metadata and Lineage Atlas Client Subscribers to Topic Gets Metadata Updates Atlas Metastore Tags Assets Entitles Streams Pipelines Feeds Hive Tables HDFS Files HBase Tables Entities in Data Lake Industry First: Dynamic Tag-based Security Policies
  26. 26. 26 © Hortonworks Inc. 2011 –2016. All Rights Reserved Use Case 1: Multi-Channel Behavioral Analysis à Industry: Mass Media – Largest broadcasting and cable company in the world by revenue – Multiple channels: Cable (set-top-box), wireless devices, streaming programming, – 22 million+ subscribers (internet & video) à Results: – Scalability: 480B rows, 500 nodes – 60x query performance improvement – Insights: New info improve negations – Loyalty: Outreach to customers viewing competitive streams; ▼churn ▲ revenue Before After Leading Media Company Hortonworks HDP AtScale Intelligence Server Hortonworks HDP Netezza Data Mart Channel Feeds Tableau + MS Excel + R Channel Feeds Tableau + MS Excel
  27. 27. 27 © Hortonworks Inc. 2011 –2016. All Rights Reserved Use Case 2: Campaign Paid-Search Effectiveness à Industry: Retail / eCommerce – Top US department store (by rev) – Online sales $4B+ & growing (11%+ total) – 800+ department stores nationwide à Results – Scale: Millions paid keywords analyzed – Speed: Eliminate extract step – Insight: Operationalized closed-loop analysis à insight à decision à action – Impact: Make and save $ millions w/ instant bid decisions over 6-week season à that drives 60% annual revenue Before After Hortonworks HDP AtScale Intelligence Server Hortonworks HDP Vertica Data Marts Ad & Paid Keywords Cognos + Tableau + Excel Ad & Paid Keywords Tableau + Excel Leading Retailer
  28. 28. 28 © Hortonworks Inc. 2011 –2016. All Rights Reserved Use Case 3: Client and Patient Analysis à Industry: Managed Health Care – Member of Fortune 100 – Health, life + other insurance products – ~ 52 million members; medical/dental/pharm à Results – Scalable: BI directly on 264+ nodes data – Time: Eliminate data movement step – 62x query performance improvement – Speed: <2.2 second average query time – Insight: Tableau on Hadoop for 1000+ – Security: Access control by user; HIPAA Before After Leading Managed Healthcare Provider Hortonworks HDP AtScale Intelligence Server Hortonworks HDP Netezza Data Mart Client / Patient Details Tableau + MS Excel Client / Patient Details Tableau + MS Excel
  29. 29. 29 © Hortonworks Inc. 2011 –2016. All Rights Reserved Next Step: Ã Everyone will receive a free copy of Forrester White Paper titled ”The Next-Generation EDW Is The Big Data Warehouse” Ã EDW Optimization with HDP – http://hortonworks.com/solutions/edw-optimization/ – EDW Optimization 7 min video
  30. 30. 30 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hortonworks Connected Data Platforms and Solutions Hortonworks Connection Hortonworks Solutions Enterprise Data Warehouse Optimization Cyber Security and Threat Management Internet of Things and Streaming Analytics Hortonworks Connection Subscription Support SmartSense Premier Support Educational Services Professional Services Community Connection Cloud Hortonworks Data Cloud AWS HDInsight Data Center Hortonworks Data Suite HDFHDP
  31. 31. 31 © Hortonworks Inc. 2011 –2016. All Rights Reserved Thank You

×