How Real Time Data
Requirements Change the Data
Warehouse Environment
Mark Madsen – September 17, 2008
www.ThirdNature.net




                     Attribution-NonCommercial-No Derivative
                     http://creativecommons.org/licenses/by-nc-nd/3.0/us/
Outline
 What’s real-time about?
 Impacts on the data
 warehouse architecture
        Delivering data to users
        Extracting the data
        Storing the data
        Operations
 Getting started


Third Nature, January 2008    Mark Madsen   Slide 2
Speeding Up the Data Warehouse



Why?
Faster reaction time

Reduced decision time

New process capabilities




Third Nature, January 2008   Mark Madsen   Slide 3
Which Decisions Benefit?
                             Strategic                 Operational
    Decision time            flexible, long cycle
                                         constrained, short
                                         cycle
    Decision scope broad, organizational narrow, departmental
                                         or process
    Decision model Complex                             Simple

    Data latency             High, history is core     Low, recent data is
                             to decisions              core to decisions

    Data scope               Many sources, many Few sources,
                             types, aggregated  structured, detailed
      Most real time needs will be driven by operational decision
      making, not strategic decisions.
Third Nature, January 2008               Mark Madsen                         Slide 4
Strategy, Decisions and Data Latency

Goal                          Increase share of low to mid market customers


Strategy               Reduce cost of products sold             Improve promotional performance



Tactics                      Efficient sourcing                     Decrease Out of Stocks




        Consolidate suppliers           Improve delivery compliance          Catch out of stocks
                                                                              before they occur


BI Needs
                    Reports &                Dashboards, alerts               Real time alerts &
                  spreadsheets                 & scorecards                  embedded analytics
Third Nature, January 2008                        Mark Madsen                                 Slide 5
What People Are Doing Today
                              Monthly      W eekly   Daily       Multiple times per day      On demand


  2002                  32%                    34%                               69%                      15%         6%




  2004              27%                  29%                         65%                       30%              19%




  2006       3                24                                  44                                 29



         0%            10%         20%         30%   40%          50%      60%         70%     80%        90%         100%
                                                                                                Sources: TDWI, Gartner


         At the same time, data volumes are rising for most data
         warehouses at 50% to 100% per year.

Third Nature, January 2008                                   Mark Madsen                                              Slide 6
BI Efforts Involving Real Time Data Access
                             Terms you may hear from the
                             BI market that imply real time:
                                   Operational BI
                                   Embedded analytics
                                   Decision automation
                                   Complex event processing
                                   Event-driven BI
                                   Process-driven BI

                             They are all similar in
                             requiring some level of low
                             latency data access.
Third Nature, January 2008   Mark Madsen                      Slide 7
Impacts on the DW Architecture
     Databases Dashboards           OLAP     Productivity   BAM/BPM         Reporting   Analytics Applications




                                                Data Consumers

                                                   Delivery

                             DW Platforms                                                        Adding current
                                                                                                 data to the system
                                     Warehouse                      Mart                         requires effort at all
                                     Database                                                    three layers
                                                                 Content
                                           ODS                    Store


                                  ETL                   EDR                    EII

      Databases              Documents     Flat Files       XML             Queues        ERP      Applications



                                            Source Environments
Third Nature, January 2008                                    Mark Madsen                                         Slide 8
One Architecture or Two?
   In-line with process:
                                                                   RT BI
      • Real time data flows separately
        from the warehouse data
      • May include a low-latency data
        store in the real time environment                         Process

      • This model be needed for
        extremely low latency data
                                                                   BI
      • More applicable for event-driven
                                                      Batch   DW



   Out of band:
      • Data to the consumer first flows                           Process
        through the DW
      • Unified architecture for both low
        and high latency data                                      BI &
                                                                   RT BI
      • More applicable for on-demand                  DW


Third Nature, January 2008              Mark Madsen                     Slide 9
User Interface: Two BI Usage Models
                                Demand driven
                                  • Users ask for current data
                                  • Most BI tools work this way
                                  • Harder to adapt these tools to
                                    event-driven models


                                Event driven
                                  • System takes action based on
                                    data, e.g. alerts, rule engines
                                  • May not have (or need) an end
                                    user interface
                                  • Need understanding of decision
                                    & action process for this model
Third Nature, January 2008   Mark Madsen                          Slide 10
BI Tools Need New Capabilities
    Embedding BI within
    applications
        • UI embedding
        • Full embedding
    Event-based integration
    Feeding BI data to
    applications: services, not
    SQL, may be desired

    Custom UI code may be
    preferable to a BI tool

Third Nature, January 2008    Mark Madsen   Slide 11
The Data Integration Layer
    • Integration is the most complex
      element of adding real time data.
    • Inline vs. out of band, demand vs.
      event-driven BI usage create
      different DI requirements.
    • You may not have exactly the
      same metrics, attributes or data
      extract logic.
    • Don’t count on replacing the ETL
      batch; more likely you are
      augmenting it.
    • You probably need to add new DI
      technologies to your portfolio.
    • Batch performance design isn’t
      like real time design.
Third Nature, January 2008       Mark Madsen   Slide 12
Speeding Up Data Integration Methods


      Single batch

                             Frequent batch

                                       Mini-batch

                                                            Continuous load

                                                                      Streaming



          Hourly+                                                        Immediate


Third Nature, January 2008                    Mark Madsen                         Slide 13
The Platform Layer: Data and Database
                                   • Schemas will need changes.
                                   • You don’t need to convert the
                                     entire database to a real time
                                     schema.
                                   • One schema or two?
                                   • Event-driven BI creates
                                     different query patterns and
                                     workloads.
                                   • Configuration and tuning may
                                     be different than what you are
                                     used to with traditional BI.
                                   • Application developers want
                                     services or ORMs, not SQL.


Third Nature, January 2008   Mark Madsen                       Slide 14
Different Platform Workloads
        Databases Dashboards        OLAP     Productivity   BAM/BPM   Reporting   Analytics Applications




                                                Data Consumers

                                                   Delivery

                             DW Platforms                                                 Three workloads:
                                                                                            Data loading +
                                     Warehouse                    Mart                      Normal BI +
                                     Database                                               Real time BI
                                                               Content                    = complications
                                           ODS                  Store


                                  ETL                   EDR              EII

         Databases           Documents     Flat Files       XML       Queues        ERP      Applications




Third Nature, January 2008
                                            Source Environments
                                                     Mark Madsen                                            Slide 15
Development, Maintenance & Operations
                                    • Real time decisions on real
                                      time data mean data
                                      quality plays a larger role,
                                      and it’s harder to address.
                                    • Warehouse availability
                                      becomes much more
                                      important to the business,
                                      and it isn’t just the
                                      database – it’s everything.
                                    • Performance and meeting
                                      strict BI SLAs will rise in
                                      importance since you are
                                      now tied in to business
                                      operations.

Third Nature, January 2008   Mark Madsen                      Slide 16
A Prescription for Getting Started
    1. Star with a decision
       process
    2. Define data needs for the
       process
    3. Ensure that data is
       available at the right
       latency
    4. Determine appropriate
       data integration
       technologies.
    5. Design and initiate
       upstream work
    6. Build
Third Nature, January 2008   Mark Madsen   Slide 17
Thanks




Third Nature, January 2008   Mark Madsen   Slide 18
CC Image Attributions
    Thanks to the people who supplied the creative commons licensed images used in this presentation:
    • Divers - http://flickr.com/photos/raveller/
    • Fast dog - http://flickr.com/photos/marinacvinhal/379111290/
    • Febo - http://flickr.com/photos/igor/419425754/
    • Subway - http://flickr.com/photos/neilsphotoalbum/504517855/
    • Cadillac ranch - http://flickr.com/photos/whatknot/179655095/




Third Nature, January 2008                          Mark Madsen                                         Slide 19
About the Presenter
                            Mark Madsen is president of Third
                            Nature, a technology research and
                            consulting firm focused on business
                            intelligence, data integration and
                            data management. Mark is an
                            award-winning author, architect and
                            CTO whose work has been featured
                            in numerous industry publications.
                            Over the past ten years Mark
                            received awards for his work from
                            the American Productivity & Quality
                            Center, TDWI, and the Smithsonian
                            Institute. He is an international
                            speaker, a contributing editor at
                            Intelligent Enterprise, and manages
                            the open source channel at the
                            Business Intelligence Network. For
                            more information or to contact Mark,
                            visit http://ThirdNature.net.
                      Page 20

How Real TIme Data Changes the Data Warehouse

  • 1.
    How Real TimeData Requirements Change the Data Warehouse Environment Mark Madsen – September 17, 2008 www.ThirdNature.net Attribution-NonCommercial-No Derivative http://creativecommons.org/licenses/by-nc-nd/3.0/us/
  • 2.
    Outline What’s real-timeabout? Impacts on the data warehouse architecture Delivering data to users Extracting the data Storing the data Operations Getting started Third Nature, January 2008 Mark Madsen Slide 2
  • 3.
    Speeding Up theData Warehouse Why? Faster reaction time Reduced decision time New process capabilities Third Nature, January 2008 Mark Madsen Slide 3
  • 4.
    Which Decisions Benefit? Strategic Operational Decision time flexible, long cycle constrained, short cycle Decision scope broad, organizational narrow, departmental or process Decision model Complex Simple Data latency High, history is core Low, recent data is to decisions core to decisions Data scope Many sources, many Few sources, types, aggregated structured, detailed Most real time needs will be driven by operational decision making, not strategic decisions. Third Nature, January 2008 Mark Madsen Slide 4
  • 5.
    Strategy, Decisions andData Latency Goal Increase share of low to mid market customers Strategy Reduce cost of products sold Improve promotional performance Tactics Efficient sourcing Decrease Out of Stocks Consolidate suppliers Improve delivery compliance Catch out of stocks before they occur BI Needs Reports & Dashboards, alerts Real time alerts & spreadsheets & scorecards embedded analytics Third Nature, January 2008 Mark Madsen Slide 5
  • 6.
    What People AreDoing Today Monthly W eekly Daily Multiple times per day On demand 2002 32% 34% 69% 15% 6% 2004 27% 29% 65% 30% 19% 2006 3 24 44 29 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Sources: TDWI, Gartner At the same time, data volumes are rising for most data warehouses at 50% to 100% per year. Third Nature, January 2008 Mark Madsen Slide 6
  • 7.
    BI Efforts InvolvingReal Time Data Access Terms you may hear from the BI market that imply real time: Operational BI Embedded analytics Decision automation Complex event processing Event-driven BI Process-driven BI They are all similar in requiring some level of low latency data access. Third Nature, January 2008 Mark Madsen Slide 7
  • 8.
    Impacts on theDW Architecture Databases Dashboards OLAP Productivity BAM/BPM Reporting Analytics Applications Data Consumers Delivery DW Platforms Adding current data to the system Warehouse Mart requires effort at all Database three layers Content ODS Store ETL EDR EII Databases Documents Flat Files XML Queues ERP Applications Source Environments Third Nature, January 2008 Mark Madsen Slide 8
  • 9.
    One Architecture orTwo? In-line with process: RT BI • Real time data flows separately from the warehouse data • May include a low-latency data store in the real time environment Process • This model be needed for extremely low latency data BI • More applicable for event-driven Batch DW Out of band: • Data to the consumer first flows Process through the DW • Unified architecture for both low and high latency data BI & RT BI • More applicable for on-demand DW Third Nature, January 2008 Mark Madsen Slide 9
  • 10.
    User Interface: TwoBI Usage Models Demand driven • Users ask for current data • Most BI tools work this way • Harder to adapt these tools to event-driven models Event driven • System takes action based on data, e.g. alerts, rule engines • May not have (or need) an end user interface • Need understanding of decision & action process for this model Third Nature, January 2008 Mark Madsen Slide 10
  • 11.
    BI Tools NeedNew Capabilities Embedding BI within applications • UI embedding • Full embedding Event-based integration Feeding BI data to applications: services, not SQL, may be desired Custom UI code may be preferable to a BI tool Third Nature, January 2008 Mark Madsen Slide 11
  • 12.
    The Data IntegrationLayer • Integration is the most complex element of adding real time data. • Inline vs. out of band, demand vs. event-driven BI usage create different DI requirements. • You may not have exactly the same metrics, attributes or data extract logic. • Don’t count on replacing the ETL batch; more likely you are augmenting it. • You probably need to add new DI technologies to your portfolio. • Batch performance design isn’t like real time design. Third Nature, January 2008 Mark Madsen Slide 12
  • 13.
    Speeding Up DataIntegration Methods Single batch Frequent batch Mini-batch Continuous load Streaming Hourly+ Immediate Third Nature, January 2008 Mark Madsen Slide 13
  • 14.
    The Platform Layer:Data and Database • Schemas will need changes. • You don’t need to convert the entire database to a real time schema. • One schema or two? • Event-driven BI creates different query patterns and workloads. • Configuration and tuning may be different than what you are used to with traditional BI. • Application developers want services or ORMs, not SQL. Third Nature, January 2008 Mark Madsen Slide 14
  • 15.
    Different Platform Workloads Databases Dashboards OLAP Productivity BAM/BPM Reporting Analytics Applications Data Consumers Delivery DW Platforms Three workloads: Data loading + Warehouse Mart Normal BI + Database Real time BI Content = complications ODS Store ETL EDR EII Databases Documents Flat Files XML Queues ERP Applications Third Nature, January 2008 Source Environments Mark Madsen Slide 15
  • 16.
    Development, Maintenance &Operations • Real time decisions on real time data mean data quality plays a larger role, and it’s harder to address. • Warehouse availability becomes much more important to the business, and it isn’t just the database – it’s everything. • Performance and meeting strict BI SLAs will rise in importance since you are now tied in to business operations. Third Nature, January 2008 Mark Madsen Slide 16
  • 17.
    A Prescription forGetting Started 1. Star with a decision process 2. Define data needs for the process 3. Ensure that data is available at the right latency 4. Determine appropriate data integration technologies. 5. Design and initiate upstream work 6. Build Third Nature, January 2008 Mark Madsen Slide 17
  • 18.
    Thanks Third Nature, January2008 Mark Madsen Slide 18
  • 19.
    CC Image Attributions Thanks to the people who supplied the creative commons licensed images used in this presentation: • Divers - http://flickr.com/photos/raveller/ • Fast dog - http://flickr.com/photos/marinacvinhal/379111290/ • Febo - http://flickr.com/photos/igor/419425754/ • Subway - http://flickr.com/photos/neilsphotoalbum/504517855/ • Cadillac ranch - http://flickr.com/photos/whatknot/179655095/ Third Nature, January 2008 Mark Madsen Slide 19
  • 20.
    About the Presenter Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net. Page 20