Hadoop Now, Next & Beyond
Community Driven Enterprise Apache Hadoop
Eric Baldeschwieler, “Eric14”
Hortonworks CTO
@jeric14




© Hortonworks Inc. 2013
Quick History: Hadoop at Yahoo!




Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/

          © Hortonworks Inc. 2013
Hortonworks Approach to Enterprise Hadoop

Community Driven Enterprise Apache Hadoop
                    Identify and introduce enterprise
                    requirements into the pubic domain

                    Work with the community to advance and
                    incubate open source projects

                    Apply Enterprise Rigor to provide the most
                    stable and reliable distribution



    © Hortonworks Inc. 2013
Making Hadoop Enterprise-Ready
                     OPERATIONAL                     DATA
                       SERVICES                    SERVICES
                             Manage &
                              AMBARI      FLUME    Store, HIVE
                                                   PIG
                             Operate at         Process and       HBASE
                               Scale      SQOOP Access Data
                               OOZIE               HCATALOG

                                                    MAP REDUCE
                                          Distributed
                      HADOOP CORE         Storage & Processing
                                                      HDFS

                                              Enterprise Readiness: HA,
                      PLATFORM SERVICES       DR, Snapshots, Security, …

                                          HORTONWORKS
                                          DATA PLATFORM (HDP)

                         OS / VM           Cloud           Appliance




   © Hortonworks Inc. 2013
HCatalog: Table-level Abstractions

 • Consistency of data models across tools (MapReduce, Pig, HBase and Hive)
 • Accessibility: share data as tables inside and out of Hadoop




                                 HCatalog                       Shared table
                                                                and schema
                                                                management
  • Raw Hadoop data                         Table access        opens the
  • Inconsistent, unknown                   Consistent schema   platform
  • Tool specific access                    REST API




       © Hortonworks Inc. 2013
Ambari: Provision > Manage > Monitor
 A framework for operating Hadoop…with APIs for integration



                               Manage     Monitor




    Provision                       Ambari          Integrate




                                 Hadoop Cluster




     © Hortonworks Inc. 2013
Ambari: Latest Highlights
                                     • Job Diagnostics


                                     • Cluster History


                                     • Instant Insight


                                     • Cluster Navigation


                                     • REST interface
Apache Ambari Dashboard




           © Hortonworks Inc. 2013
See Hadoop > Learn Hadoop > Do Hadoop




      Hands on
                               Full environment
    step-by- step
                                 to evaluate
  tutorials to learn
                                    Hadoop




     © Hortonworks Inc. 2013
Hadoop 2.0 Innovations - YARN

• Focus on scale and innovation
   – Support 10,000+ computer clusters
   – Extensible to encourage innovation




                                                                    Graph Processing
• Next generation execution




                                             MapReduce
   – Improves MapReduce performance




                                                                                       Other
                                                          Tez
• Supports new frameworks beyond
  MapReduce                                 YARN: Cluster Resource Management
   – Low latency, Streaming, Services
   – Do more with a single Hadoop cluster
                                            HDFS         Redundant, Reliable Storage




       © Hortonworks Inc. 2013
Tez on YARN: Going Beyond Batch




                Tez Task



 Tez Optimizes Execution          Always-On Tez Service
  New runtime engine for         Low latency processing for
more efficient data processing   all Hadoop data processing


      © Hortonworks Inc. 2013
Stinger Initiative
• Community initiative around Hive
• Enables Hive to support interactive workloads
• Enhances Hive’s standard SQL interface for Hadoop
• Improves existing tools & preserves investments


 Execution                       Query         File
  Engine                        Planner       Format

                        +                 +              = 100X
    Tez                          Hive         ORC file




      © Hortonworks Inc. 2013
Stinger: Make Hive Fly For All BI Needs
Parameterized Reports                                                       Enterprise Reports

                                            Dashboard / Scorecard




    Visualization                                                              Data Mining

                                                More SQL
                                                     +

                                            100X Faster

                                  Interactive                       Batch

        © Hortonworks Inc. 2013
Knox: Make Hadoop Security Simple

                                Authentication &
                                  Verification
                                    User Store     Hadoop Cluster
                                     KDC, AD,
                                      LDAP




                       {REST}        Knox
 Client                             Gateway




      © Hortonworks Inc. 2013
Hortonworks Data Platform 2.0 Alpha 2

Key New Features                                  Business Value
 –    Hive performance
 –    First distribution to include Tez
                                                  Single Platform Multiple Use
 From HDP 2.0 Alpha                          BATCH           INTERACTIVE            ONLINE
 –    Yarn
 –    Full Stack HA
 –    Snapshots
 –    Disaster Recovery                                      Big Data
                                                Transactions, Interactions, Observations
 –    Rolling Upgrades




          Available today                 http://Hortonworks.com/products

                                                                                           Page 14
             © Hortonworks Inc. 2013
Falcon: Data Lifecycle Management
• New Apache Incubator Project

• Introduced by InMobi, Hortonworks and Yahoo!

• Data Lifecycle Management Framework for Hadoop

• Configure and Manage Workflows & Policies for:

  – Data Movement
  – Disaster Recovery
  – Data Retention



     © Hortonworks Inc. 2013
Join the Community & Get Involved!


                                             • INNOVATE
                               Open Source

    Vendors
                                             • INTEGRATE


                             End Users       • COMMUNICATE




   © Hortonworks Inc. 2013

Apache Hadoop Now Next and Beyond

  • 1.
    Hadoop Now, Next& Beyond Community Driven Enterprise Apache Hadoop Eric Baldeschwieler, “Eric14” Hortonworks CTO @jeric14 © Hortonworks Inc. 2013
  • 2.
    Quick History: Hadoopat Yahoo! Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/ © Hortonworks Inc. 2013
  • 3.
    Hortonworks Approach toEnterprise Hadoop Community Driven Enterprise Apache Hadoop Identify and introduce enterprise requirements into the pubic domain Work with the community to advance and incubate open source projects Apply Enterprise Rigor to provide the most stable and reliable distribution © Hortonworks Inc. 2013
  • 4.
    Making Hadoop Enterprise-Ready OPERATIONAL DATA SERVICES SERVICES Manage & AMBARI FLUME Store, HIVE PIG Operate at Process and HBASE Scale SQOOP Access Data OOZIE HCATALOG MAP REDUCE Distributed HADOOP CORE Storage & Processing HDFS Enterprise Readiness: HA, PLATFORM SERVICES DR, Snapshots, Security, … HORTONWORKS DATA PLATFORM (HDP) OS / VM Cloud Appliance © Hortonworks Inc. 2013
  • 5.
    HCatalog: Table-level Abstractions • Consistency of data models across tools (MapReduce, Pig, HBase and Hive) • Accessibility: share data as tables inside and out of Hadoop HCatalog Shared table and schema management • Raw Hadoop data Table access opens the • Inconsistent, unknown Consistent schema platform • Tool specific access REST API © Hortonworks Inc. 2013
  • 6.
    Ambari: Provision >Manage > Monitor A framework for operating Hadoop…with APIs for integration Manage Monitor Provision Ambari Integrate Hadoop Cluster © Hortonworks Inc. 2013
  • 7.
    Ambari: Latest Highlights • Job Diagnostics • Cluster History • Instant Insight • Cluster Navigation • REST interface Apache Ambari Dashboard © Hortonworks Inc. 2013
  • 8.
    See Hadoop >Learn Hadoop > Do Hadoop Hands on Full environment step-by- step to evaluate tutorials to learn Hadoop © Hortonworks Inc. 2013
  • 9.
    Hadoop 2.0 Innovations- YARN • Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation Graph Processing • Next generation execution MapReduce – Improves MapReduce performance Other Tez • Supports new frameworks beyond MapReduce YARN: Cluster Resource Management – Low latency, Streaming, Services – Do more with a single Hadoop cluster HDFS Redundant, Reliable Storage © Hortonworks Inc. 2013
  • 10.
    Tez on YARN:Going Beyond Batch Tez Task Tez Optimizes Execution Always-On Tez Service New runtime engine for Low latency processing for more efficient data processing all Hadoop data processing © Hortonworks Inc. 2013
  • 11.
    Stinger Initiative • Communityinitiative around Hive • Enables Hive to support interactive workloads • Enhances Hive’s standard SQL interface for Hadoop • Improves existing tools & preserves investments Execution Query File Engine Planner Format + + = 100X Tez Hive ORC file © Hortonworks Inc. 2013
  • 12.
    Stinger: Make HiveFly For All BI Needs Parameterized Reports Enterprise Reports Dashboard / Scorecard Visualization Data Mining More SQL + 100X Faster Interactive Batch © Hortonworks Inc. 2013
  • 13.
    Knox: Make HadoopSecurity Simple Authentication & Verification User Store Hadoop Cluster KDC, AD, LDAP {REST} Knox Client Gateway © Hortonworks Inc. 2013
  • 14.
    Hortonworks Data Platform2.0 Alpha 2 Key New Features Business Value – Hive performance – First distribution to include Tez Single Platform Multiple Use From HDP 2.0 Alpha BATCH INTERACTIVE ONLINE – Yarn – Full Stack HA – Snapshots – Disaster Recovery Big Data Transactions, Interactions, Observations – Rolling Upgrades Available today http://Hortonworks.com/products Page 14 © Hortonworks Inc. 2013
  • 15.
    Falcon: Data LifecycleManagement • New Apache Incubator Project • Introduced by InMobi, Hortonworks and Yahoo! • Data Lifecycle Management Framework for Hadoop • Configure and Manage Workflows & Policies for: – Data Movement – Disaster Recovery – Data Retention © Hortonworks Inc. 2013
  • 16.
    Join the Community& Get Involved! • INNOVATE Open Source Vendors • INTEGRATE End Users • COMMUNICATE © Hortonworks Inc. 2013

Editor's Notes

  • #6 HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
  • #8 Job DiagnosticsVisualize and troubleshoot Hadoop job execution and performanceCluster History View historical job execution & performanceInstant InsightView health of Core Hadoop (HDFS, MapReduce) and related projectsCluster Navigation “Quick link” buttons jump into namenode web UI for a serverREST interface provides external access to Ambari for existing tools. Facilitates integration with Microsoft System Center and Teradata Viewpoint
  • #9 Hortonworks SandboxHortonworks accelerates Hadoop skills development with an easy-to-use, flexible and extensible platform to learn, evaluate and use Apache HadoopWhat is it: virtualized single-node implementation of the enterprise-ready Hortonworks Data PlatformProvides demos, videos and step-by-step hands-on tutorialsPre-built partner integrations and access to datasetsWhat it does: Dramatically accelerates the process of learning Apache HadoopSee It -- demos and videos to illustrate use casesLearn It -- multi level step by step tutorials Do It -- hands on exercises for faster skills developmentHow it helps: Accelerate and validates the use of Hadoop within your unique data architectureUse your data to explore and investigate your use casesZERO to big data in 15 minutes
  • #10 Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
  • #11 Tez Approved as New Apache Incubator ProjectHortonworks Introduces Next-Generation Runtime for Improving Latency and Throughput of Hadoop Apps
  • #12 Buzz about low latency access in Hadoop
  • #13 Hortonworks Unveils Stinger Initiative to Make Apache Hive 100X Faster for Interactive QueriesHortonworks leading effort with a group of community contributors focusing on enhancing Apache Hive, the defacto standard for SQL access to HadoopEnterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
  • #14 Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
  • #15 Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…