Hadoop and the Data Warehouse
    Patrick Angeles




1
About Me

    •   Director of Field Engineering at Cloudera
         •   Architect on several dozen Hadoop-based data solutions
             for Cloudera customers
    •   Started with Hadoop in 2008
         •   First Hadoop system processed set-top box log data
    •   Past life
         •   Java EE / Database Architect
         •   Web Data Mining
         •   Cryptography / Public Key Infrastructure



2
What is a Data Warehouse?




3
— The Oracle



4
Database Architecture 1.0




       Products
                                Inventory
       Customers       DB
                                Sales
       Orders




5
Database Architecture 1.0

     •   Dead simple
     •   Tables in 3rd normal form
     •   Reports are SQL queries that join through entity
         relationships and aggregate

                  SELECT   c.gender, p.product_name,
                           sum(o.qty), sum(o.price)
                  FROM     order o, customer c, product p
                  WHERE    o.customer_id = c.id
                   AND     o.product_id = p.id
                   AND     o.day = ’2013-03-21’
                  GROUP BY c.gender, p.product_name ;


6
Database Architecture 1.0

     •   Report queries can become expensive, redundant
     •   Build a layer of abstraction!
     •   Materialize the data to something closer to query
         form.
     •   Create reporting tables
          •   Decide on the reports columns
          •   What query criteria can be parameterized
          •   Periodicity of report generation
          •   Denormalize and aggregate

7
Database Architecture 1.1




                               Inventory
               Customers
                                      Sales
                      Orders
           Products




8
Two Database Workloads

           Transactional     Analytic
              Record facts   Reveal patterns

          Write-optimized    Read-optimized

      Random reads/writes    Sequential reads

       Normalized schema     Denormalized schema



9
Analytical Database (2.0)




              Customers          Inventory

                     Orders             Sales
          Products




10
Analytical Database Architecture

      •   Column oriented storage
           •   Reduces I/O on multi-dimensional tables
           •   Improved compression
           •   Skip columns or row ranges
      •   Massively Parallel Processing
           •   Query planner breaks up a task to be executed on
               multiple hosts
      •   Shared-nothing Architecture
           •   Cluster nodes have independent storage and memory
      •   Slow writes, fast reads

11
Analytical Database




                    TX     Analytical
                    DB        DB




12
Data Transformation




                   TX      Analytical
                   DB         DB




13
Three Ways to Transform Data

      •   Transform Extract Load
           •   Query from transactional tables into target schema
      •   Extract Load Transform
           •   Load data into analytical database, transform and write
               to target schema
           •   No need for additional hardware
      •   Extract Transform Load
           •   Read data from transactional database into a grid
               system, transform, then write to analytical database
           •   Least load on tx and analytical systems

14
Business Intelligence Tools




             TX          Analytical
                                      BI
             DB             DB




15
Business Intelligence Tools

      • Can provide canned reports, dashboards, or
        interactive visualizations
      • Typically leverage common standards (SQL,
        JDBC/ODBC) to access data
      • Requires low-latency (sub second or minute,
        depending on query) response times from database




16
Observations

      • Separate transactional from analytical workloads
      • Use appropriate database implementation
        according to the workload
          •   ‘Traditional’ row-major store for transactional
          •   MPP column-store for analytic
      • Consider a BI tool so you’re not stuck writing
        reports for analysts who don’t know SQL
      • Consider an ETL tool so you’re not stuck writing
        transformations for analysts who don’t know SQL


17
Welcome to the Enterprise




18
Basic Data Warehouse Architecture




             TX                   BI
                        DW
             DB




19
Data Marts


                       Sales




           TX          Mktg    BI
                  DW
           DB




                       Prch




20
Multiple Data Sources

          TX
          DB                  Sales




         Files           DW   Mktg    BI




         other                Prch




21
Operational Data Store

       TX
       DB                          Sales




      Files                        Mktg    BI
                ODS           DW




      other                        Prch




22
Where’s Hadoop?




23
No Hadoop

      TX
      DB                    Sales




      Files                 Mktg    BI
                 ODS   DW




     other                  Prch




24
Adjacent System

       TX
       DB                   Sales




      Files                 Mktg    BI
                       DW



                ODS
      other                 Prch




25
ETL Engine

       TX
       DB              Sales




      Files            Mktg    BI
                  DW




      other            Prch




26
Tiered Data Warehouse

             TX
             DB              Sales




            Files            Mktg    BI




            other            Prch




27
Analytical Query Engine

               TX
               DB




              Files            BI




              other




28
Simple Database Architecture




        Products
                                    Inventory
        Customers       DB          Sales
        Orders




29
The future?




        Products
                    Inventory
        Customers
                    Sales
        Orders




30
http://www.hbasecon.com/
            San Francisco
            June 13, 2013




31
32

Hadoop and Enterprise Data Warehouse

  • 1.
    Hadoop and theData Warehouse Patrick Angeles 1
  • 2.
    About Me • Director of Field Engineering at Cloudera • Architect on several dozen Hadoop-based data solutions for Cloudera customers • Started with Hadoop in 2008 • First Hadoop system processed set-top box log data • Past life • Java EE / Database Architect • Web Data Mining • Cryptography / Public Key Infrastructure 2
  • 3.
    What is aData Warehouse? 3
  • 4.
  • 5.
    Database Architecture 1.0 Products Inventory Customers DB Sales Orders 5
  • 6.
    Database Architecture 1.0 • Dead simple • Tables in 3rd normal form • Reports are SQL queries that join through entity relationships and aggregate SELECT c.gender, p.product_name, sum(o.qty), sum(o.price) FROM order o, customer c, product p WHERE o.customer_id = c.id AND o.product_id = p.id AND o.day = ’2013-03-21’ GROUP BY c.gender, p.product_name ; 6
  • 7.
    Database Architecture 1.0 • Report queries can become expensive, redundant • Build a layer of abstraction! • Materialize the data to something closer to query form. • Create reporting tables • Decide on the reports columns • What query criteria can be parameterized • Periodicity of report generation • Denormalize and aggregate 7
  • 8.
    Database Architecture 1.1 Inventory Customers Sales Orders Products 8
  • 9.
    Two Database Workloads Transactional Analytic Record facts Reveal patterns Write-optimized Read-optimized Random reads/writes Sequential reads Normalized schema Denormalized schema 9
  • 10.
    Analytical Database (2.0) Customers Inventory Orders Sales Products 10
  • 11.
    Analytical Database Architecture • Column oriented storage • Reduces I/O on multi-dimensional tables • Improved compression • Skip columns or row ranges • Massively Parallel Processing • Query planner breaks up a task to be executed on multiple hosts • Shared-nothing Architecture • Cluster nodes have independent storage and memory • Slow writes, fast reads 11
  • 12.
    Analytical Database TX Analytical DB DB 12
  • 13.
    Data Transformation TX Analytical DB DB 13
  • 14.
    Three Ways toTransform Data • Transform Extract Load • Query from transactional tables into target schema • Extract Load Transform • Load data into analytical database, transform and write to target schema • No need for additional hardware • Extract Transform Load • Read data from transactional database into a grid system, transform, then write to analytical database • Least load on tx and analytical systems 14
  • 15.
    Business Intelligence Tools TX Analytical BI DB DB 15
  • 16.
    Business Intelligence Tools • Can provide canned reports, dashboards, or interactive visualizations • Typically leverage common standards (SQL, JDBC/ODBC) to access data • Requires low-latency (sub second or minute, depending on query) response times from database 16
  • 17.
    Observations • Separate transactional from analytical workloads • Use appropriate database implementation according to the workload • ‘Traditional’ row-major store for transactional • MPP column-store for analytic • Consider a BI tool so you’re not stuck writing reports for analysts who don’t know SQL • Consider an ETL tool so you’re not stuck writing transformations for analysts who don’t know SQL 17
  • 18.
    Welcome to theEnterprise 18
  • 19.
    Basic Data WarehouseArchitecture TX BI DW DB 19
  • 20.
    Data Marts Sales TX Mktg BI DW DB Prch 20
  • 21.
    Multiple Data Sources TX DB Sales Files DW Mktg BI other Prch 21
  • 22.
    Operational Data Store TX DB Sales Files Mktg BI ODS DW other Prch 22
  • 23.
  • 24.
    No Hadoop TX DB Sales Files Mktg BI ODS DW other Prch 24
  • 25.
    Adjacent System TX DB Sales Files Mktg BI DW ODS other Prch 25
  • 26.
    ETL Engine TX DB Sales Files Mktg BI DW other Prch 26
  • 27.
    Tiered Data Warehouse TX DB Sales Files Mktg BI other Prch 27
  • 28.
    Analytical Query Engine TX DB Files BI other 28
  • 29.
    Simple Database Architecture Products Inventory Customers DB Sales Orders 29
  • 30.
    The future? Products Inventory Customers Sales Orders 30
  • 31.
    http://www.hbasecon.com/ San Francisco June 13, 2013 31
  • 32.

Editor's Notes

  • #3 Architected scores of Hadoop-based data solutions
  • #6 Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
  • #9 Turns out separating the transactional vs reporting database brings other benefits
  • #11 I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
  • #13 2 other major components that haven’t been mentioned
  • #14 I don’t need up to the minute reportsCopy data to reporting DBNow workloads don’t conflictI can now have a different reporting schemaFaster queriesNow I have to worry about transforming dataI can now use different technology
  • #15 Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
  • #16 Two things this allows you to do- Use different underlying architectures for each database
  • #17 Not a trivial thing… there’s a X’s billion dollars market segment dedicated to making this easier.Informatica, Pervasive, Ab Initio, PentahoSpeaking of making things easier…
  • #20 Two things this allows you to do- Use different underlying architectures for each database
  • #21 Data marts designed for specific department needs.Kimball ?
  • #22 Two things this allows you to do- Use different underlying architectures for each database
  • #23 Ralph Kimball – The Data Warehousing ToolkitBill Inmon – Building the Data Warehouse
  • #26 Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
  • #27 Challenge with normal grid-based ETL is you have to load data from source systems.Hadoop’s cost-efficient storage allows enterprises to store source data in Hadoop,thereby replacing the ETL grid.You could also forego the ODS if there is one in the architecture.Option to enrich data that is published to the DW by running analytics not available to traditional DW/BI stack. E.g., clustering, classification, statistical
  • #28 Store long term dataTransform and load to data marts
  • #29 Store long term dataBI tools can readily query data in Hadoop using Impala
  • #30 Doesn’t scaleLimited storageConcurrent writes / queriesWhat if I want different reports?
  • #31 Support for insert/update semantics?HBase with typed columns