Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop and Enterprise Data Warehouse


Published on

Published in: Technology, Business
  • Be the first to comment

Hadoop and Enterprise Data Warehouse

  1. 1. Hadoop and the Data Warehouse Patrick Angeles1
  2. 2. About Me • Director of Field Engineering at Cloudera • Architect on several dozen Hadoop-based data solutions for Cloudera customers • Started with Hadoop in 2008 • First Hadoop system processed set-top box log data • Past life • Java EE / Database Architect • Web Data Mining • Cryptography / Public Key Infrastructure2
  3. 3. What is a Data Warehouse?3
  4. 4. — The Oracle4
  5. 5. Database Architecture 1.0 Products Inventory Customers DB Sales Orders5
  6. 6. Database Architecture 1.0 • Dead simple • Tables in 3rd normal form • Reports are SQL queries that join through entity relationships and aggregate SELECT c.gender, p.product_name, sum(o.qty), sum(o.price) FROM order o, customer c, product p WHERE o.customer_id = AND o.product_id = AND = ’2013-03-21’ GROUP BY c.gender, p.product_name ;6
  7. 7. Database Architecture 1.0 • Report queries can become expensive, redundant • Build a layer of abstraction! • Materialize the data to something closer to query form. • Create reporting tables • Decide on the reports columns • What query criteria can be parameterized • Periodicity of report generation • Denormalize and aggregate7
  8. 8. Database Architecture 1.1 Inventory Customers Sales Orders Products8
  9. 9. Two Database Workloads Transactional Analytic Record facts Reveal patterns Write-optimized Read-optimized Random reads/writes Sequential reads Normalized schema Denormalized schema9
  10. 10. Analytical Database (2.0) Customers Inventory Orders Sales Products10
  11. 11. Analytical Database Architecture • Column oriented storage • Reduces I/O on multi-dimensional tables • Improved compression • Skip columns or row ranges • Massively Parallel Processing • Query planner breaks up a task to be executed on multiple hosts • Shared-nothing Architecture • Cluster nodes have independent storage and memory • Slow writes, fast reads11
  12. 12. Analytical Database TX Analytical DB DB12
  13. 13. Data Transformation TX Analytical DB DB13
  14. 14. Three Ways to Transform Data • Transform Extract Load • Query from transactional tables into target schema • Extract Load Transform • Load data into analytical database, transform and write to target schema • No need for additional hardware • Extract Transform Load • Read data from transactional database into a grid system, transform, then write to analytical database • Least load on tx and analytical systems14
  15. 15. Business Intelligence Tools TX Analytical BI DB DB15
  16. 16. Business Intelligence Tools • Can provide canned reports, dashboards, or interactive visualizations • Typically leverage common standards (SQL, JDBC/ODBC) to access data • Requires low-latency (sub second or minute, depending on query) response times from database16
  17. 17. Observations • Separate transactional from analytical workloads • Use appropriate database implementation according to the workload • ‘Traditional’ row-major store for transactional • MPP column-store for analytic • Consider a BI tool so you’re not stuck writing reports for analysts who don’t know SQL • Consider an ETL tool so you’re not stuck writing transformations for analysts who don’t know SQL17
  18. 18. Welcome to the Enterprise18
  19. 19. Basic Data Warehouse Architecture TX BI DW DB19
  20. 20. Data Marts Sales TX Mktg BI DW DB Prch20
  21. 21. Multiple Data Sources TX DB Sales Files DW Mktg BI other Prch21
  22. 22. Operational Data Store TX DB Sales Files Mktg BI ODS DW other Prch22
  23. 23. Where’s Hadoop?23
  24. 24. No Hadoop TX DB Sales Files Mktg BI ODS DW other Prch24
  25. 25. Adjacent System TX DB Sales Files Mktg BI DW ODS other Prch25
  26. 26. ETL Engine TX DB Sales Files Mktg BI DW other Prch26
  27. 27. Tiered Data Warehouse TX DB Sales Files Mktg BI other Prch27
  28. 28. Analytical Query Engine TX DB Files BI other28
  29. 29. Simple Database Architecture Products Inventory Customers DB Sales Orders29
  30. 30. The future? Products Inventory Customers Sales Orders30
  31. 31. San Francisco June 13, 201331
  32. 32. 32