Hadoop World 2011

Hadoop Stack: Then, Now and
Future
In the beginning

                                                                                      CORE HADOOP COMPONENTS
   Hadoop was a platform for data
   storage and processing that is…                                            Hadoop                                MapReduce
                                                                          Distributed File
        Scalable                                                         System (HDFS)
        Fault tolerant
        Open source                                                        File Sharing & Data
                                                                             Protection Across
                                                                                                                  Distributed Computing
                                                                                                                 Across Physical Servers
                                                                             Physical Servers




             Flexibility                                   Scalability                                           Low Cost
 A single repository for storing       Scale-out architecture divides                                Can be deployed on commodity
  processing & analyzing any type of     workloads across multiple nodes                                hardware
  data                                  Flexible file system eliminates ETL                           Open source platform guards
 Not bound by a single schema           bottlenecks                                                    against vendor lock




   2                                    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                                       Reproduction or redistribution without written permission is
                                                               prohibited.
A good start

     Apache Hadoop




                                                           Shell / CLI
                      Data Processing                             Resource Management
                                                         File storage
                     Formats                                  RPC                      Compression




3                        ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                        Reproduction or redistribution without written permission is
                                                prohibited.
Core use cases

    • Data processing
      – Search index building
      – Click sessionization




4                   ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                   Reproduction or redistribution without written permission is
                                           prohibited.
We were here
             100%             100%
Core
Hadoop                                                                                  58%
                                                          37%                                            37%                31%
as % of
New
Patches
            2006             2007                       2008                          2009              2010               2011
           • Core Hadoop   • Core Hadoop            •   Core Hadoop              •   Core Hadoop   •   Core Hadoop   •   Core Hadoop
                                                    •   HBase                    •   HBase         •   HBase         •   HBase
                                                    •   Zookeeper                •   Pig           •   Pig           •   Pig
                                                    •   Mahout                   •   Zookeeper     •   Zookeeper     •   Zookeeper
                                                                                 •   Mahout        •   Mahout        •   Mahout
                                                                                 •   Hive          •   Hive          •   Hive
Relevant                                                                                           •   Avro          •   Avro
Projects                                                                                           •   Whirr         •   Whirr
                                                                                                   •   Sqoop         •   Sqoop
                                                                                                                     •   HCatalog
                                                                                                                     •   Mrunit
                                                                                                                     •   Bigtop
                                                                                                                     •   Oozie




  5                                ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                                  Reproduction or redistribution without written permission is
                                                          prohibited.
First cut at the system



                                                               Shell / CLI
                 Languages             Libraries         Workflow
                      Data Processing          Resource Management
                                   Metadata storage
                                                         Record storage
                                                          File storage
                                                          Coordination
                   Formats                                        RPC          Compression




6                ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                Reproduction or redistribution without written permission is
                                        prohibited.
Underlying projects & communities
                        Apache Pig,
Apache Hadoop
                        Hive, Mahout




                                                                           Shell / CLI
                             Languages             Libraries         Workflow
Apache Hive                       Data Processing          Resource Management
                                               Metadata storage
                                                                     Record storage
                                                                      File storage
Apache                                                                Coordination
HBase
                               Formats                                        RPC          Compression
                Apache            Apache
                Zookeeper         Avro

 7                           ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                            Reproduction or redistribution without written permission is
                                                    prohibited.
Core use cases

    • Data processing
      – Search index building
      – Click sessionization
      – Data processing pipelines
    • Analytics
      – Machine learning
      – Batch reporting
    • Live content serving (for the braver folks)

8                   ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                   Reproduction or redistribution without written permission is
                                           prohibited.
We were here
             100%             100%
Core
Hadoop                                                                                  58%
                                                          37%                                            37%                31%
as % of
New
Patches
            2006             2007                       2008                          2009              2010               2011
           • Core Hadoop   • Core Hadoop            •   Core Hadoop              •   Core Hadoop   •   Core Hadoop   •   Core Hadoop
                                                    •   HBase                    •   HBase         •   HBase         •   HBase
                                                    •   Zookeeper                •   Pig           •   Pig           •   Pig
                                                    •   Mahout                   •   Zookeeper     •   Zookeeper     •   Zookeeper
                                                                                 •   Mahout        •   Mahout        •   Mahout
                                                                                 •   Hive          •   Hive          •   Hive
Relevant                                                                                           •   Avro          •   Avro
Projects                                                                                           •   Whirr         •   Whirr
                                                                                                   •   Sqoop         •   Sqoop
                                                                                                                     •   HCatalog
                                                                                                                     •   Mrunit
                                                                                                                     •   Bigtop
                                                                                                                     •   Oozie




  9                                ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                                  Reproduction or redistribution without written permission is
                                                          prohibited.
Where we are today



                                     Web                                    Shell / CLI             Drivers
      Files
                            Languages        Libraries    Workflow   Scheduling
                                   Data Processing        Resource Management
              Integration



                                                Metadata storage
     RDBMS
                                                                     Record storage
                                                                      File storage
     Logs &                                                           Coordination
     events
                            Formats                         RPC                   Authentication   Compression




10                           ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                            Reproduction or redistribution without written permission is
                                                    prohibited.
Where we are today
                   Hue                 Apache Pig,                                                          Apache     JDBC /
Apache Hadoop
                                       Hive, Mahout                                                         Oozie      ODBC

Apache
Sqoop
                                                  Web                                    Shell / CLI             Drivers
                Files
                                        Languages         Libraries    Workflow   Scheduling
Apache Hive,                                    Data Processing        Resource Management
                         Integration



HCatalog
                                                             Metadata storage
               RDBMS
                                                                                  Record storage
                                                                                   File storage
Apache         Logs &                                                              Coordination
HBase          events
                                         Formats    RPC                                        Authentication   Compression
Apache           Apache                      Apache
Flume            Zookeeper                   Avro

 11                                       ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                                         Reproduction or redistribution without written permission is
                                                                 prohibited.
Core use cases

 • Data processing
     – Search index building
     – Click sessionization
     – Data processing pipelines
 • Analytics
     – Machine learning
     – Batch reporting
 • Real time applications
     – Content serving
     – System management
     – Real-time aggregates & counters
 • Storage
     – EDW archive



12                         ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                          Reproduction or redistribution without written permission is
                                                  prohibited.
Current state
             100%             100%
Core
Hadoop                                                                                  58%
                                                          37%                                            37%                31%
as % of
New
Patches
            2006             2007                       2008                          2009              2010               2011
           • Core Hadoop   • Core Hadoop            •   Core Hadoop              •   Core Hadoop   •   Core Hadoop   •   Core Hadoop
                                                    •   HBase                    •   HBase         •   HBase         •   HBase
                                                    •   Zookeeper                •   Pig           •   Pig           •   Pig
                                                    •   Mahout                   •   Zookeeper     •   Zookeeper     •   Zookeeper
                                                                                 •   Mahout        •   Mahout        •   Mahout
                                                                                 •   Hive          •   Hive          •   Hive
Relevant                                                                                           •   Avro          •   Avro
Projects                                                                                           •   Whirr         •   Whirr
                                                                                                   •   Sqoop         •   Sqoop
                                                                                                                     •   HCatalog
                                                                                                                     •   Mrunit
                                                                                                                     •   Bigtop
                                                                                                                     •   Oozie




  13                               ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                                  Reproduction or redistribution without written permission is
                                                          prohibited.
Limitations

     Redundancy - DAG, RPC, serialization, integration, etc.

     Uniformity - diff components require diff DBs, mgt interfaces,
     etc.

     Ease of use - improving but still an obstacle. Eg non-native
     file formats require integration.

     Multi-datacenter - cross-DC repl. for HBase but not HDFS.

     Interoperability - requires conversions, end-user integration.




14                        ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                         Reproduction or redistribution without written permission is
                                                 prohibited.
Ongoing work


 Metadata repos - shared schema and data
 types, table abstraction via Apache HCat
 (incubating) and Apache Hive.
 Self-describing data via Apache Avro.




15              ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
               Reproduction or redistribution without written permission is
                                       prohibited.
Ongoing work: Apache Bigtop

 Dedicated to Hadoop stack integration and testing.

 Integration - between projects, dependencies, hosts.

 Testing - interoperability, multi-component use cases.

 100% Apache projects, using upstream releases.
 Participants across the ecosystem - join us!
 http://incubator.apache.org/bigtop




16                    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                     Reproduction or redistribution without written permission is
                                             prohibited.
Technical trends - software

 • Moving more forms of computation to
   Hadoop storage
 • Frameworks to make HBase more
   application and developer friendly
 • Taking advantage of pluggability to provide
   more optimized formats, schedulers,
   codecs, etc
 • More granular security models


17               ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                Reproduction or redistribution without written permission is
                                        prohibited.
Technical trends - hardware

 •Increasingly powerful hosts
    l# cores and memory

    lNetwork - 10/40 gige

    lStorage - 48/60 TB hosts. Flash.

 •Cloud - multi-tenancy and virtualization
 •Low power CPUs




18               ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                Reproduction or redistribution without written permission is
                                        prohibited.
Enable future use cases pt 1

 More valuable data
 •Cost = gravity. Data flows downhill to cheapest store.
 •High-value data not just generated but also consumed by
 the platform ie more processing is done within the system
 before leaving.

 Richer end user applications
 •Apps built directly on the platform (eBay’s Cassini,
 Facebook messages, etc)
 •Web 3.0 – data centric apps. Apps move over common
 data sources vs tightly coupled to their data.



19                   ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                    Reproduction or redistribution without written permission is
                                            prohibited.
Enable future use cases pt 2

 Lower latency / higher interactivity

 •Low latency response times for applications
 •Interactive - human-driven, correlated access, eg
 analytics
 •Low latency query execution and in-memory
 datasets.
 •Resource management - batch and interactive
 workloads



20                  ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                   Reproduction or redistribution without written permission is
                                           prohibited.
Enable future use cases pt 3

 Hadoop meets ILM

 Policy - access control, std mgt interfaces, SLAs. MDM,
 etc.

 Operation - disaster recovery, archive, etc.

 Traditional features - availability, snapshots, mirroring,
 ACLs, integration via standard protocols.




21                    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                     Reproduction or redistribution without written permission is
                                             prohibited.
Things to look forward to


                                        Web                                    Shell / CLI             Drivers

      Files                   Languages                    Libraries    Workflow                       Scheduling
                             MapReduce                 Stream      Graph      MPI                         Other
                                                               Resource Management
               Integration


                                                                      Metadata storage
     RDBMS                    Time Series                               ORM                   OLAP       OLTP
                                                                        Record storage
                                                                         File storage
      Logs &                                                             Coordination
      events
                               Formats                         RPC                   Authentication   Compression




22                              ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                               Reproduction or redistribution without written permission is
                                                       prohibited.
Getting crowded…
                   Hue                 Apache Pig, Apache S4                                            X-Rime    Apache   JDBC /
Apache Hadoop
                                       Hive, Mahout Storm                                               Giraph    Oozie    ODBC

                                                  Web                                    Shell / CLI                 Drivers
Apache
Sqoop           Files                   Languages                    Libraries    Workflow                           Scheduling
                                       MapReduce                 Stream      Graph      MPI                             Other
                                                                         Resource Management
                         Integration

Apache Hive,                                                                    Metadata storage
HCatalog
               RDBMS                    Time Series                               ORM                      OLAP        OLTP
OpenTSDB                                                                          Record storage
                                                                                   File storage
Apache         Logs &                                                              Coordination
HBase          events
                                         Formats   RPC     Authentication                                           Compression
Apache           Apache                      Apache Apache
Flume            Zookeeper                   Avro   Gora                                                               Omid


 23                                       ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
                                         Reproduction or redistribution without written permission is
                                                                 prohibited.
We appreciate your time and
                       interest in

              For Additional Information:


                +1 (888) 789-1488                                       twitter.com/
                                                                         cloudera
                  sales@cloudera.com                     cloudera.com
                                                                        facebook.com/
                                                                          cloudera




24    ©2011 Cloudera, Inc. All Rights Reserved. Confidential.
     Reproduction or redistribution without written permission is
                             prohibited.

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Collins & Charles Zedlewski, Cloudera

  • 1.
    Hadoop World 2011 HadoopStack: Then, Now and Future
  • 2.
    In the beginning CORE HADOOP COMPONENTS Hadoop was a platform for data storage and processing that is… Hadoop MapReduce Distributed File  Scalable System (HDFS)  Fault tolerant  Open source File Sharing & Data Protection Across Distributed Computing Across Physical Servers Physical Servers Flexibility Scalability Low Cost  A single repository for storing  Scale-out architecture divides  Can be deployed on commodity processing & analyzing any type of workloads across multiple nodes hardware data  Flexible file system eliminates ETL  Open source platform guards  Not bound by a single schema bottlenecks against vendor lock 2 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 3.
    A good start Apache Hadoop Shell / CLI Data Processing Resource Management File storage Formats RPC Compression 3 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 4.
    Core use cases • Data processing – Search index building – Click sessionization 4 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 5.
    We were here 100% 100% Core Hadoop 58% 37% 37% 31% as % of New Patches 2006 2007 2008 2009 2010 2011 • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • HBase • HBase • HBase • HBase • Zookeeper • Pig • Pig • Pig • Mahout • Zookeeper • Zookeeper • Zookeeper • Mahout • Mahout • Mahout • Hive • Hive • Hive Relevant • Avro • Avro Projects • Whirr • Whirr • Sqoop • Sqoop • HCatalog • Mrunit • Bigtop • Oozie 5 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 6.
    First cut atthe system Shell / CLI Languages Libraries Workflow Data Processing Resource Management Metadata storage Record storage File storage Coordination Formats RPC Compression 6 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 7.
    Underlying projects &communities Apache Pig, Apache Hadoop Hive, Mahout Shell / CLI Languages Libraries Workflow Apache Hive Data Processing Resource Management Metadata storage Record storage File storage Apache Coordination HBase Formats RPC Compression Apache Apache Zookeeper Avro 7 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 8.
    Core use cases • Data processing – Search index building – Click sessionization – Data processing pipelines • Analytics – Machine learning – Batch reporting • Live content serving (for the braver folks) 8 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 9.
    We were here 100% 100% Core Hadoop 58% 37% 37% 31% as % of New Patches 2006 2007 2008 2009 2010 2011 • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • HBase • HBase • HBase • HBase • Zookeeper • Pig • Pig • Pig • Mahout • Zookeeper • Zookeeper • Zookeeper • Mahout • Mahout • Mahout • Hive • Hive • Hive Relevant • Avro • Avro Projects • Whirr • Whirr • Sqoop • Sqoop • HCatalog • Mrunit • Bigtop • Oozie 9 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 10.
    Where we aretoday Web Shell / CLI Drivers Files Languages Libraries Workflow Scheduling Data Processing Resource Management Integration Metadata storage RDBMS Record storage File storage Logs & Coordination events Formats RPC Authentication Compression 10 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 11.
    Where we aretoday Hue Apache Pig, Apache JDBC / Apache Hadoop Hive, Mahout Oozie ODBC Apache Sqoop Web Shell / CLI Drivers Files Languages Libraries Workflow Scheduling Apache Hive, Data Processing Resource Management Integration HCatalog Metadata storage RDBMS Record storage File storage Apache Logs & Coordination HBase events Formats RPC Authentication Compression Apache Apache Apache Flume Zookeeper Avro 11 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 12.
    Core use cases • Data processing – Search index building – Click sessionization – Data processing pipelines • Analytics – Machine learning – Batch reporting • Real time applications – Content serving – System management – Real-time aggregates & counters • Storage – EDW archive 12 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 13.
    Current state 100% 100% Core Hadoop 58% 37% 37% 31% as % of New Patches 2006 2007 2008 2009 2010 2011 • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • HBase • HBase • HBase • HBase • Zookeeper • Pig • Pig • Pig • Mahout • Zookeeper • Zookeeper • Zookeeper • Mahout • Mahout • Mahout • Hive • Hive • Hive Relevant • Avro • Avro Projects • Whirr • Whirr • Sqoop • Sqoop • HCatalog • Mrunit • Bigtop • Oozie 13 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 14.
    Limitations Redundancy - DAG, RPC, serialization, integration, etc. Uniformity - diff components require diff DBs, mgt interfaces, etc. Ease of use - improving but still an obstacle. Eg non-native file formats require integration. Multi-datacenter - cross-DC repl. for HBase but not HDFS. Interoperability - requires conversions, end-user integration. 14 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 15.
    Ongoing work Metadatarepos - shared schema and data types, table abstraction via Apache HCat (incubating) and Apache Hive. Self-describing data via Apache Avro. 15 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 16.
    Ongoing work: ApacheBigtop Dedicated to Hadoop stack integration and testing. Integration - between projects, dependencies, hosts. Testing - interoperability, multi-component use cases. 100% Apache projects, using upstream releases. Participants across the ecosystem - join us! http://incubator.apache.org/bigtop 16 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 17.
    Technical trends -software • Moving more forms of computation to Hadoop storage • Frameworks to make HBase more application and developer friendly • Taking advantage of pluggability to provide more optimized formats, schedulers, codecs, etc • More granular security models 17 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 18.
    Technical trends -hardware •Increasingly powerful hosts l# cores and memory lNetwork - 10/40 gige lStorage - 48/60 TB hosts. Flash. •Cloud - multi-tenancy and virtualization •Low power CPUs 18 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 19.
    Enable future usecases pt 1 More valuable data •Cost = gravity. Data flows downhill to cheapest store. •High-value data not just generated but also consumed by the platform ie more processing is done within the system before leaving. Richer end user applications •Apps built directly on the platform (eBay’s Cassini, Facebook messages, etc) •Web 3.0 – data centric apps. Apps move over common data sources vs tightly coupled to their data. 19 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 20.
    Enable future usecases pt 2 Lower latency / higher interactivity •Low latency response times for applications •Interactive - human-driven, correlated access, eg analytics •Low latency query execution and in-memory datasets. •Resource management - batch and interactive workloads 20 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 21.
    Enable future usecases pt 3 Hadoop meets ILM Policy - access control, std mgt interfaces, SLAs. MDM, etc. Operation - disaster recovery, archive, etc. Traditional features - availability, snapshots, mirroring, ACLs, integration via standard protocols. 21 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 22.
    Things to lookforward to Web Shell / CLI Drivers Files Languages Libraries Workflow Scheduling MapReduce Stream Graph MPI Other Resource Management Integration Metadata storage RDBMS Time Series ORM OLAP OLTP Record storage File storage Logs & Coordination events Formats RPC Authentication Compression 22 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 23.
    Getting crowded… Hue Apache Pig, Apache S4 X-Rime Apache JDBC / Apache Hadoop Hive, Mahout Storm Giraph Oozie ODBC Web Shell / CLI Drivers Apache Sqoop Files Languages Libraries Workflow Scheduling MapReduce Stream Graph MPI Other Resource Management Integration Apache Hive, Metadata storage HCatalog RDBMS Time Series ORM OLAP OLTP OpenTSDB Record storage File storage Apache Logs & Coordination HBase events Formats RPC Authentication Compression Apache Apache Apache Apache Flume Zookeeper Avro Gora Omid 23 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 24.
    We appreciate yourtime and interest in For Additional Information: +1 (888) 789-1488 twitter.com/ cloudera sales@cloudera.com cloudera.com facebook.com/ cloudera 24 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.