Coordinating the Many
        Tools of Big Data
Strata 2013

Alan F. Gates
@alanfgates




                              Page 1
Big Data = Terabytes, Petabytes, …




Image Credit: Gizmodo
             © Hortonworks 2013
                                        Page 2
But It Is Also Complex Algorithms
• An example from a talk by Jimmy Lin at Hadoop Summit
  2012 on calculations Twitter is doing via UDFs in Pig.
  This equation uses stochastic gradient descent to do
  machine learning with their data:



   w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)




      © Hortonworks 2013
                                                       Page 3
And New Tools
• Apache Hadoop brings with it a large selection of tools
  and paradigms
   – Apache HBase, Apache Cassandra – Distributed, high volume
     reads and rights of individual data records
   – Apache Hive - SQL
   – Apache Pig, Cascading – Data flow programming for ETL, data
     modeling, and exploration
   – Apache Giraph – Graph processing
   – MapReduce – Batch processing
   – Storm, S4 – Stream processing
   – Plus lots of commercial offerings




      © Hortonworks 2013
                                                                   Page 4
Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
  SAS).



                             Data
                             Mart
                                             Statistical
                                             Analysis
         Data
       Warehouse


                             Cube/M                        OLTP
                              OLAP



        © Hortonworks 2013
                                                                            Page 5
Cloud: Many Tools One Platform
   • Users no longer want to be concerned with what platform their data is in – just
     apply the tool to it
   • SQL no longer the only or primary data access tool

                                                                           Statistical
                  Data                                                     Analysis
                  Mart
  Data
Warehouse




Cube/M                                                                   OLT
 OLAP                                                                     P




            © Hortonworks 2013
                                                                                     Page 6
Upside - Pick the Right Tool for the Job




    © Hortonworks 2013
                                       Page 7
Downside – Tools Don’t Play Well Together

• Hard for users to share data between tools
  – Different storage formats
  – Different data models
  – Different user defined function interfaces




      © Hortonworks 2013
                                                 Page 8
Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
  redundant functionality


                                          Hive

                             Pig         Parser

                            Parser     Metadata

                           Optimizer   Optimizer
                           Physical     Physical
                           Planner      Planner

                           Executor     Executor


      © Hortonworks 2013
                                                       Page 9
Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
  redundant functionality


                                                   Hive

                             Pig                  Parser

                            Parser               Metadata

                           Optimizer             Optimizer
                           Physical              Physical
                                       Overlap
                           Planner               Planner

                           Executor              Executor


      © Hortonworks 2013
                                                             Page 10
Conclusion: We Need Services
• We need to find a way to share services where we can
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense




        © Hortonworks 2013
                                                          Page 11
Hadoop = Distributed Data Operating
System
Service                                                   Hadoop Component

Table Management                                          Hive

Access To Metadata                                        HCatalog

User authentication                                       Knox

Resource management                                       YARN

Notification                                              HCatalog

REST/Connectors                                           webhcat, webhdfs, Hive, HBase,
                                                          Oozie
Relational data processing                                Tez

                               Exists   Pieces exist in this component   New Project

          © Hortonworks 2013
                                                                                           Page 12
Hadoop = Distributed Data Operating
System
Service                                                   Hadoop Component

Table Management                                          Hive

Access To Metadata                                        HCatalog

User authentication                                       Knox

Resource management                                       YARN

Notification                                              HCatalog

REST/Connectors                                           webhcat, webhdfs, Hive, HBase,
                                                          Oozie
Relational data processing                                Tez

                               Exists   Pieces exist in this component   New Project

          © Hortonworks 2013
                                                                                           Page 13
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access




      © Hortonworks 2013
                                                             Page 14
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                             Hive




                           Metastore




      © Hortonworks 2013
                                                             Page 15
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                             Hive            Pig
                                            HCat
                                           Loader



                           Metastore      MapReduce
                                           HCatInput
                                            Format

      © Hortonworks 2013
                                                             Page 16
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                                Hive         Pig
   External
   Systems                                  HCat
                                           Loader
   REST
                    WebHCat
                              Metastore   MapReduce
                                           HCatInput
                                            Format

      © Hortonworks 2013
                                                             Page 17
Tez – Moving Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
  etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
  the queue between steps in the pipeline
• Does not write intermediate output to HDFS
  – Much lighter disk and network usage
• Built on YARN



      © Hortonworks 2013
                                                       Page 18
Pig/Hive-MR versus Pig/Hive-Tez
                                            SELECT a.state, COUNT(*), AVERAGE(c.price)
                                                             FROM a
                                                      JOIN b ON (a.id = b.id)
                                                  JOIN c ON (a.itemId = c.itemId)
                                                        GROUP BY a.state


                                   Job 1



                                                          Job 2

I/O Synchronization
      Barrier




             I/O Synchronization
                   Barrier




                                                  Job 3




                         Pig/Hive - MR
                       © Hortonworks 2013
                                                                                         Page 19
Pig/Hive-MR versus Pig/Hive-Tez
                                            SELECT a.state, COUNT(*), AVERAGE(c.price)
                                                             FROM a
                                                      JOIN b ON (a.id = b.id)
                                                  JOIN c ON (a.itemId = c.itemId)
                                                        GROUP BY a.state


                                   Job 1



                                                          Job 2

I/O Synchronization
      Barrier




             I/O Synchronization
                   Barrier



                                                                         Single Job


                                                  Job 3




                         Pig/Hive - MR                                                   Pig/Hive - Tez
                       © Hortonworks 2013
                                                                                                          Page 20
FastQuery: Beyond Batch with YARN




 Tez Generalizes Map-Reduce           Always-On Tez Service
Simplified execution plans process   Low latency processing for
        data more efficiently        all Hadoop data processing




       © Hortonworks 2013
                                                                  Page 21
Knox – Single Sign On




   © Hortonworks 2013
                        Page 22
Today’s Access Options
• Direct Access
   – Access Services via REST (WebHDFS, WebHCat)
   – Need knowledge of and access to whole cluster
   – Security handled by each component in the cluster
   – Kerberos details exposed to users


          User              {REST}   Hadoop Cluster


• Gateway / Portal Nodes
   – Dedicated nodes behind firewall
   – User SSH to node to access Hadoop services

                             SSH
                                      GW
          User                                  Hadoop Cluster
                                     Node


       © Hortonworks 2013
                                                                 Page 23
Knox Design Goals
• Operators can firewall cluster without end user access to
  “gateway node”
• Users see one cluster end-point that aggregates
  capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup
  easier
• Enable integration enterprise and cloud identity
  management environments




      © Hortonworks 2013
                                                        Page 24
Perimeter Verification & Authentication
Verification
- Verify identity token                       Authentication       Hadoop Cluster
- SAML, propagation of identity
Authentication
                                                    User Store
- Establish identity at Gateway to
  Authenticate with LDAP + AD                        KDC, AD,             DN        DN
                                                      LDAP
                                                                  Web     DN        DN
                                                                  HDFS
                                                                               NN
                            {REST}                    Knox
         Client                                      Gateway

                                                                               JT
                                                                  Web
                                                                               Hive
                                     ID Provider                  HCat
                                      KDC, AD,
                                        LDAP                                 HCat

                                                   Verification
                © Hortonworks 2013
                                                                                      Page 25
Thank You




   © Hortonworks 2012
                        Page 26

Strata feb2013

  • 1.
    Coordinating the Many Tools of Big Data Strata 2013 Alan F. Gates @alanfgates Page 1
  • 2.
    Big Data =Terabytes, Petabytes, … Image Credit: Gizmodo © Hortonworks 2013 Page 2
  • 3.
    But It IsAlso Complex Algorithms • An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data: w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y) © Hortonworks 2013 Page 3
  • 4.
    And New Tools •Apache Hadoop brings with it a large selection of tools and paradigms – Apache HBase, Apache Cassandra – Distributed, high volume reads and rights of individual data records – Apache Hive - SQL – Apache Pig, Cascading – Data flow programming for ETL, data modeling, and exploration – Apache Giraph – Graph processing – MapReduce – Batch processing – Storm, S4 – Stream processing – Plus lots of commercial offerings © Hortonworks 2013 Page 4
  • 5.
    Pre-Cloud: One Toolper Machine • Databases presented SQL or SQL-like paradigms for operating on data • Other tools came in separate packages (e.g. R) or on separate platforms (e.g. SAS). Data Mart Statistical Analysis Data Warehouse Cube/M OLTP OLAP © Hortonworks 2013 Page 5
  • 6.
    Cloud: Many ToolsOne Platform • Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Statistical Data Analysis Mart Data Warehouse Cube/M OLT OLAP P © Hortonworks 2013 Page 6
  • 7.
    Upside - Pickthe Right Tool for the Job © Hortonworks 2013 Page 7
  • 8.
    Downside – ToolsDon’t Play Well Together • Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces © Hortonworks 2013 Page 8
  • 9.
    Downside – WastedDeveloper Time • Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Planner Planner Executor Executor © Hortonworks 2013 Page 9
  • 10.
    Downside – WastedDeveloper Time • Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Overlap Planner Planner Executor Executor © Hortonworks 2013 Page 10
  • 11.
    Conclusion: We NeedServices • We need to find a way to share services where we can • Gives users the same experience across tools • Allows developers to share effort when it makes sense © Hortonworks 2013 Page 11
  • 12.
    Hadoop = DistributedData Operating System Service Hadoop Component Table Management Hive Access To Metadata HCatalog User authentication Knox Resource management YARN Notification HCatalog REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie Relational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 12
  • 13.
    Hadoop = DistributedData Operating System Service Hadoop Component Table Management Hive Access To Metadata HCatalog User authentication Knox Resource management YARN Notification HCatalog REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie Relational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 13
  • 14.
    HCatalog – TableManagement • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access © Hortonworks 2013 Page 14
  • 15.
    HCatalog – TableManagement • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Metastore © Hortonworks 2013 Page 15
  • 16.
    HCatalog – TableManagement • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Pig HCat Loader Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 16
  • 17.
    HCatalog – TableManagement • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Pig External Systems HCat Loader REST WebHCat Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 17
  • 18.
    Tez – MovingBeyond MapReduce • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, Cascading etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS – Much lighter disk and network usage • Built on YARN © Hortonworks 2013 Page 18
  • 19.
    Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Job 3 Pig/Hive - MR © Hortonworks 2013 Page 19
  • 20.
    Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR Pig/Hive - Tez © Hortonworks 2013 Page 20
  • 21.
    FastQuery: Beyond Batchwith YARN Tez Generalizes Map-Reduce Always-On Tez Service Simplified execution plans process Low latency processing for data more efficiently all Hadoop data processing © Hortonworks 2013 Page 21
  • 22.
    Knox – SingleSign On © Hortonworks 2013 Page 22
  • 23.
    Today’s Access Options •Direct Access – Access Services via REST (WebHDFS, WebHCat) – Need knowledge of and access to whole cluster – Security handled by each component in the cluster – Kerberos details exposed to users User {REST} Hadoop Cluster • Gateway / Portal Nodes – Dedicated nodes behind firewall – User SSH to node to access Hadoop services SSH GW User Hadoop Cluster Node © Hortonworks 2013 Page 23
  • 24.
    Knox Design Goals •Operators can firewall cluster without end user access to “gateway node” • Users see one cluster end-point that aggregates capabilities for data access, metadata and job control • Provide perimeter security to make Hadoop security setup easier • Enable integration enterprise and cloud identity management environments © Hortonworks 2013 Page 24
  • 25.
    Perimeter Verification &Authentication Verification - Verify identity token Authentication Hadoop Cluster - SAML, propagation of identity Authentication User Store - Establish identity at Gateway to Authenticate with LDAP + AD KDC, AD, DN DN LDAP Web DN DN HDFS NN {REST} Knox Client Gateway JT Web Hive ID Provider HCat KDC, AD, LDAP HCat Verification © Hortonworks 2013 Page 25
  • 26.
    Thank You © Hortonworks 2012 Page 26

Editor's Notes

  • #3 This is how we tend to think of Big data
  • #6 Limited in a couple of ways:Scalability limited by being on one machine or a small cluster that counts on all participants being upHard to apply different types of processing without moving data around
  • #7 Hive is the only SQL based app in this pileOther apps still in the picture, it’s not like Hadoop is displacing everything