Apache Hadoop Now Next and Beyond


Published on

With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Apache Hadoop is fast becoming the defacto platform for processing Big Data. Hadoop started from a relatively humble beginning as a point solution for small search systems. Its growth into an important technology to the broader enterprise community dates back to Yahoo’s 2006 decision to evolve Hadoop into a system for solving its internet scale big data problems. Eric will discuss the current state of Hadoop and what is coming from a development standpoint as Hadoop evolves to meet more workloads.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
  • Job DiagnosticsVisualize and troubleshoot Hadoop job execution and performanceCluster History View historical job execution & performanceInstant InsightView health of Core Hadoop (HDFS, MapReduce) and related projectsCluster Navigation “Quick link” buttons jump into namenode web UI for a serverREST interface provides external access to Ambari for existing tools. Facilitates integration with Microsoft System Center and Teradata Viewpoint
  • Hortonworks SandboxHortonworks accelerates Hadoop skills development with an easy-to-use, flexible and extensible platform to learn, evaluate and use Apache HadoopWhat is it: virtualized single-node implementation of the enterprise-ready Hortonworks Data PlatformProvides demos, videos and step-by-step hands-on tutorialsPre-built partner integrations and access to datasetsWhat it does: Dramatically accelerates the process of learning Apache HadoopSee It -- demos and videos to illustrate use casesLearn It -- multi level step by step tutorials Do It -- hands on exercises for faster skills developmentHow it helps: Accelerate and validates the use of Hadoop within your unique data architectureUse your data to explore and investigate your use casesZERO to big data in 15 minutes
  • Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)
  • Tez Approved as New Apache Incubator ProjectHortonworks Introduces Next-Generation Runtime for Improving Latency and Throughput of Hadoop Apps
  • Buzz about low latency access in Hadoop
  • Hortonworks Unveils Stinger Initiative to Make Apache Hive 100X Faster for Interactive QueriesHortonworks leading effort with a group of community contributors focusing on enhancing Apache Hive, the defacto standard for SQL access to HadoopEnterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
  • Operators can firewall cluster without end user access to “gateway node”Users see one cluster end-point that aggregates capabilities for data access, metadata and job controlProvide perimeter security to make Hadoop security setup easierEnable integration enterprise and cloud identity management environmentsVerificationVerify identity tokenSAML, propagation of identityAuthenticationEstablish identity at Gateway to Authenticate with LDAP + AD
  • Hadoop 2.0 represents the next generation of the foundation of big data. Under development for nearly three years now, It is a more mature version of Hadoop that has been architected for broader use by more generic enterprise. The main focus for this nest generation has been the broader enterprise. They have very explicit requirements that are a little bit different than the typical web properties who first adopted hadoop. Some of the requirements required the community to rethink the approach. Plus, our experience running hadoop at yahoo provided much insight into how we could architect things to make them better.Some of the critical features are listed here. Go through them.Highlight workloads and explain how 2.0 is engineered to meet these exacting demands. There is a graphic to help illustrate. We have moved beyond just batch…
  • Apache Hadoop Now Next and Beyond

    1. 1. Hadoop Now, Next & BeyondCommunity Driven Enterprise Apache HadoopEric Baldeschwieler, “Eric14”Hortonworks CTO@jeric14© Hortonworks Inc. 2013
    2. 2. Quick History: Hadoop at Yahoo!Source: http://developer.yahoo.com/blogs/ydn/posts/2013/02/hadoop-at-yahoo-more-than-ever-before/ © Hortonworks Inc. 2013
    3. 3. Hortonworks Approach to Enterprise HadoopCommunity Driven Enterprise Apache Hadoop Identify and introduce enterprise requirements into the pubic domain Work with the community to advance and incubate open source projects Apply Enterprise Rigor to provide the most stable and reliable distribution © Hortonworks Inc. 2013
    4. 4. Making Hadoop Enterprise-Ready OPERATIONAL DATA SERVICES SERVICES Manage & AMBARI FLUME Store, HIVE PIG Operate at Process and HBASE Scale SQOOP Access Data OOZIE HCATALOG MAP REDUCE Distributed HADOOP CORE Storage & Processing HDFS Enterprise Readiness: HA, PLATFORM SERVICES DR, Snapshots, Security, … HORTONWORKS DATA PLATFORM (HDP) OS / VM Cloud Appliance © Hortonworks Inc. 2013
    5. 5. HCatalog: Table-level Abstractions • Consistency of data models across tools (MapReduce, Pig, HBase and Hive) • Accessibility: share data as tables inside and out of Hadoop HCatalog Shared table and schema management • Raw Hadoop data Table access opens the • Inconsistent, unknown Consistent schema platform • Tool specific access REST API © Hortonworks Inc. 2013
    6. 6. Ambari: Provision > Manage > Monitor A framework for operating Hadoop…with APIs for integration Manage Monitor Provision Ambari Integrate Hadoop Cluster © Hortonworks Inc. 2013
    7. 7. Ambari: Latest Highlights • Job Diagnostics • Cluster History • Instant Insight • Cluster Navigation • REST interfaceApache Ambari Dashboard © Hortonworks Inc. 2013
    8. 8. See Hadoop > Learn Hadoop > Do Hadoop Hands on Full environment step-by- step to evaluate tutorials to learn Hadoop © Hortonworks Inc. 2013
    9. 9. Hadoop 2.0 Innovations - YARN• Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation Graph Processing• Next generation execution MapReduce – Improves MapReduce performance Other Tez• Supports new frameworks beyond MapReduce YARN: Cluster Resource Management – Low latency, Streaming, Services – Do more with a single Hadoop cluster HDFS Redundant, Reliable Storage © Hortonworks Inc. 2013
    10. 10. Tez on YARN: Going Beyond Batch Tez Task Tez Optimizes Execution Always-On Tez Service New runtime engine for Low latency processing formore efficient data processing all Hadoop data processing © Hortonworks Inc. 2013
    11. 11. Stinger Initiative• Community initiative around Hive• Enables Hive to support interactive workloads• Enhances Hive’s standard SQL interface for Hadoop• Improves existing tools & preserves investments Execution Query File Engine Planner Format + + = 100X Tez Hive ORC file © Hortonworks Inc. 2013
    12. 12. Stinger: Make Hive Fly For All BI NeedsParameterized Reports Enterprise Reports Dashboard / Scorecard Visualization Data Mining More SQL + 100X Faster Interactive Batch © Hortonworks Inc. 2013
    13. 13. Knox: Make Hadoop Security Simple Authentication & Verification User Store Hadoop Cluster KDC, AD, LDAP {REST} Knox Client Gateway © Hortonworks Inc. 2013
    14. 14. Hortonworks Data Platform 2.0 Alpha 2Key New Features Business Value – Hive performance – First distribution to include Tez Single Platform Multiple Use From HDP 2.0 Alpha BATCH INTERACTIVE ONLINE – Yarn – Full Stack HA – Snapshots – Disaster Recovery Big Data Transactions, Interactions, Observations – Rolling Upgrades Available today http://Hortonworks.com/products Page 14 © Hortonworks Inc. 2013
    15. 15. Falcon: Data Lifecycle Management• New Apache Incubator Project• Introduced by InMobi, Hortonworks and Yahoo!• Data Lifecycle Management Framework for Hadoop• Configure and Manage Workflows & Policies for: – Data Movement – Disaster Recovery – Data Retention © Hortonworks Inc. 2013
    16. 16. Join the Community & Get Involved! • INNOVATE Open Source Vendors • INTEGRATE End Users • COMMUNICATE © Hortonworks Inc. 2013