Accelerated Analytics for the Big Data Fabric
       Bay Area Hadoop User Group




       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
AGENDA



 The Big Data Fabric
 Big Data Preparation – An Everyday Challenge
 Use-Case Scenario – Call Volume Analysis
    Solution Requirements
    Solution Workflow
    Phase I - Data Preparation & Visualization
    Phase II - Pentaho MapReduce & Orchestration
 Summary




                                                                                                      2
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
The Big Data Fabric




                                                                                Data Integration Big Analytics
   Pentaho Business Analytics                 3rd Party Tools
                                                             R
       Visualization      Dashboards              3rd   Party BI Tools
   Interactive Analysis    Reports                      Applications



Data Integration                                                 Scheduling
Job Orchestration                                            High Performance
    Workflow                                                      Visual IDE



   Hadoop                                                  Analytic Databases
                                NoSQL Databases




                                                                                Big Data Mgmt
                                                                                                                 3
Preparing Big Data for Analysis
          is an Everyday Challenge


                                             •        Very technical skills required
                                             •        Divide between M-R developers & analysts
                                             •        Beyond the reach of many organizations




                                                                                             4
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho Visual MapReduce




                                           Accessible by any ETL
                                           developer, business analyst or data
                                           scientist

                                           Executes inside Hadoop as a native
                                           Java MapReduce task
   © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
                                                                                    5
Pentaho Reporting & Analytics




          Batch Reporting
         and Ad Hoc Query
                                                                                      Data Visualization, Discovery
                                                                                              and Analysis




Hadoop                                    NoSQL                                                           Hybrid
                                                                                                                      6
                   © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Use Case Scenario – Call Volume Analysis

• VOIP service provider has excess capacity and is
  considering expansion to consumer markets
• Business Analyst: what are the top 10 states for
  inbound calls on Fridays, Saturdays and Sundays?
• Research data available:
   – Call records – date/timestamp & destination phone #
                                                                                                        ?
   – NANP (North American Numbering Plan) data – area
     code by country, state & time zone




                                                                                                            7
                       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Solution Requirements

• Data Preparation
   – Access the call records in HDFS
   – Extract the destination area code for each call
   – Read the area code reference data
   – Lookup country, state and time zone by area code, append to each
     record
   – Filter out records (non-U.S. calls, calls made on M-Tu-W-Th)
   – Load to a relational database
   – Generate metadata
• Analysis
   – Explore data multi-dimensionally
   – Find the top-10 states by inbound call volume
   – Navigate via a geospatial interface
• Deployment
   – Deploy in MapReduce to handle larger data volumes

                                                                                                      8
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Solution Workflow


• Phase I - Business Analysts
   – Use a data extract to prepare and validate their analyses
   – Iterate over requirements with executives and stake-holders


• Phase II - MapReduce Developers/Analysts
   – Create production Pentaho MapReduce transformations
   – Manage the deployment and orchestration between the
     Hadoop cluster and the production database




                                                                                                      9
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Preparation (Phase I)




• The data pipeline implements the data preparation logic
• Each component has a “personality”– access, calculate, join, filter …
• Free-form design
    – As many or as few inputs, transformations and outputs as needed
• Schema contract exists only between connected components
• Pipelined, multi-threaded for performance
• 100% Java-based for deployment flexibility



                                                                                                       10
                      © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Input from HDFS




                                                                                      11
     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline - Calculator




                                                                                   12
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Stream Lookup




                                                                                     13
    © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Row Filter




                                                                                   14
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Data Pipeline – Table Output




                                                                                    15
   © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization – Multi-Dimensional UX




                                                                                        16
       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization – Geographic




                                                                                   17
  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Visualization - Heatmap




                                                                                  18
 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Deployment to Hadoop (Phase II)


• To process a larger set of data we can deploy the data pipeline via
  MapReduce
    – Input and output streams are encoded in key-value pairs
    – Two specialized components provide an interface:




    – A special job component deploys the data pipeline to the Hadoop
      cluster:




                                                                                                       19
                      © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho MapReduce – Inputs/Outputs



      The core logic of the data pipeline is
       identical … only the ends change




                                ........




                                                                                         20
        © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Pentaho MapReduce – Orchestration




                                                                                        21
       © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Instant Analytics (Roadmap)




Choose a Big Data Source,
Answer a Few Questions,
   Publish to Pentaho


                                                Report, Explore and
                                                     Analyze




                                                                                                             Customize Model
                                                                                                                (Optional)
                                                                                                                               22
                            © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
SUMMARY



1. The Big Data Fabric encompasses a large collection of Hadoop
   distributions, NoSQL and analytical databases
2. A component-based approach to data access and integration can:
   – Allow business analysts and data scientists to perform their own data
     preparation
   – Result in more rapid validation of business requirements & metrics
   – Be used to create data pipelines that can be deployed directly to a
     cluster, enabling analytics against much larger data sets
   – Support orchestration across environments




                                                                                                      23
                     © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Summary




                                                                                 24
© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
Thank You
Join the conversation. You can find us on:

     http://blog.pentaho.com

     @Pentaho

     Facebook.com/Pentaho

     Pentaho Business Analytics



  © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Bay Area Hadoop User Group

  • 1.
    Accelerated Analytics forthe Big Data Fabric Bay Area Hadoop User Group © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 2.
    AGENDA  The BigData Fabric  Big Data Preparation – An Everyday Challenge  Use-Case Scenario – Call Volume Analysis  Solution Requirements  Solution Workflow  Phase I - Data Preparation & Visualization  Phase II - Pentaho MapReduce & Orchestration  Summary 2 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 3.
    The Big DataFabric Data Integration Big Analytics Pentaho Business Analytics 3rd Party Tools R Visualization Dashboards 3rd Party BI Tools Interactive Analysis Reports Applications Data Integration Scheduling Job Orchestration High Performance Workflow Visual IDE Hadoop Analytic Databases NoSQL Databases Big Data Mgmt 3
  • 4.
    Preparing Big Datafor Analysis is an Everyday Challenge • Very technical skills required • Divide between M-R developers & analysts • Beyond the reach of many organizations 4 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 5.
    Pentaho Visual MapReduce Accessible by any ETL developer, business analyst or data scientist Executes inside Hadoop as a native Java MapReduce task © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 5
  • 6.
    Pentaho Reporting &Analytics Batch Reporting and Ad Hoc Query Data Visualization, Discovery and Analysis Hadoop NoSQL Hybrid 6 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 7.
    Use Case Scenario– Call Volume Analysis • VOIP service provider has excess capacity and is considering expansion to consumer markets • Business Analyst: what are the top 10 states for inbound calls on Fridays, Saturdays and Sundays? • Research data available: – Call records – date/timestamp & destination phone # ? – NANP (North American Numbering Plan) data – area code by country, state & time zone 7 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 8.
    Solution Requirements • DataPreparation – Access the call records in HDFS – Extract the destination area code for each call – Read the area code reference data – Lookup country, state and time zone by area code, append to each record – Filter out records (non-U.S. calls, calls made on M-Tu-W-Th) – Load to a relational database – Generate metadata • Analysis – Explore data multi-dimensionally – Find the top-10 states by inbound call volume – Navigate via a geospatial interface • Deployment – Deploy in MapReduce to handle larger data volumes 8 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 9.
    Solution Workflow • PhaseI - Business Analysts – Use a data extract to prepare and validate their analyses – Iterate over requirements with executives and stake-holders • Phase II - MapReduce Developers/Analysts – Create production Pentaho MapReduce transformations – Manage the deployment and orchestration between the Hadoop cluster and the production database 9 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 10.
    Data Preparation (PhaseI) • The data pipeline implements the data preparation logic • Each component has a “personality”– access, calculate, join, filter … • Free-form design – As many or as few inputs, transformations and outputs as needed • Schema contract exists only between connected components • Pipelined, multi-threaded for performance • 100% Java-based for deployment flexibility 10 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 11.
    Data Pipeline –Input from HDFS 11 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 12.
    Data Pipeline -Calculator 12 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 13.
    Data Pipeline –Stream Lookup 13 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 14.
    Data Pipeline –Row Filter 14 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 15.
    Data Pipeline –Table Output 15 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 16.
    Visualization – Multi-DimensionalUX 16 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 17.
    Visualization – Geographic 17 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 18.
    Visualization - Heatmap 18 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 19.
    Deployment to Hadoop(Phase II) • To process a larger set of data we can deploy the data pipeline via MapReduce – Input and output streams are encoded in key-value pairs – Two specialized components provide an interface: – A special job component deploys the data pipeline to the Hadoop cluster: 19 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 20.
    Pentaho MapReduce –Inputs/Outputs The core logic of the data pipeline is identical … only the ends change ........ 20 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 21.
    Pentaho MapReduce –Orchestration 21 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 22.
    Instant Analytics (Roadmap) Choosea Big Data Source, Answer a Few Questions, Publish to Pentaho Report, Explore and Analyze Customize Model (Optional) 22 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 23.
    SUMMARY 1. The BigData Fabric encompasses a large collection of Hadoop distributions, NoSQL and analytical databases 2. A component-based approach to data access and integration can: – Allow business analysts and data scientists to perform their own data preparation – Result in more rapid validation of business requirements & metrics – Be used to create data pipelines that can be deployed directly to a cluster, enabling analytics against much larger data sets – Support orchestration across environments 23 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 24.
    Summary 24 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555
  • 25.
    Thank You Join theconversation. You can find us on: http://blog.pentaho.com @Pentaho Facebook.com/Pentaho Pentaho Business Analytics © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Editor's Notes

  • #7 Leveraging PDI to incorporate Big Data into your data fabric provides immediate access to analytics, examples: Batch and Ad Hoc reporting directly against Big Data Data sources using familiar BI tools with no coding – Report Designer, Interactive Reporting Agile framework to quickly generate/house/manage data marts for interactive analysis, data discovery, etc.