Hackathon bonn

479 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
479
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Hello Today I’m going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
  • Founded just 2.5 years ago from the original hadoop team members a yahoo.

    Hortonworks emerged as the leader in open source Hadoop.

    We are commited to ensure H is an enterprise viable data platform ready for your modern data architecture

    Our team is probably the largest assembled team of Hadoop experts and active leaders in the community

    We not only make sure Hadoop meets all your enterprise requirements like
    Operations, reliablity & Security

    It also needs to be
    Packaged & Tested and we do this.

    It has to work with what you have


    Make Hadoop an enterprise data platform. Make the market function.
    Innovate core platform, data, & operational services
    Integrate deeply with enterprise ecosystem
    Provide world-class enterprise support

    Drive 100% open source software development and releases through the core Apache projects
    Address enterprise needs in community projects
    Establish Apache foundation projects as “the standard”
    Promote open community vs. vendor control / lock-in

    Enable the Hadoop market to function
    Make it easy for enterprises to deploy at scale
    Be the best at enabling deep ecosystem integration
    Create a pull market with key strategic partners

  • Tez Approved as New Apache Incubator Project Hortonworks Introduces Next-Generation Runtime for Improving Latency and Throughput of Hadoop Apps
  • Make Hadoop an enterprise data platform
    Innovate core platform, data, & operational services
    Integrate deeply with enterprise ecosystem
    Provide world-class enterprise support

    Drive 100% open source software development and releases through the core Apache projects
    Address enterprise needs in community projects
    Establish Apache foundation projects as “the standard”
    Promote open community vs. vendor control / lock-in

    Enable the Hadoop market to function
    Make it easy for enterprises to deploy at scale
    Be the best at enabling deep ecosystem integration
    Create a pull market with key strategic partners

  • Hackathon bonn

    1. 1. Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop YARN, Tez, Stinger June 2014
    2. 2. Our Mission: Our Commitment Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Page 2 Headquarters: Palo Alto, CA Employees: 300+ and growing Trusted Partners Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
    3. 3. Driving Our Innovation Through Apache 147,933 lines 614,041 lines End Users 449,768 lines Total Net Lines Contributed to Apache Hadoop Yahoo: 10 Cloudera: 7 IBM: 3 10 Others 21 Facebook: 5 LinkedIn: 3 Total Number of Committers to Apache Hadoop 63 total Hortonworks mission is to power your modern data architecture by enabling Hadoop to be an enterprise data platform that deeply integrates with your data center technologies Page 3 Apache Project Committers PMC Members Hadoop 21 13 Tez 10 4 Hive 11 3 HBase 8 3 Pig 6 5 Sqoop 1 0 Ambari 20 12 Knox 6 2 Falcon 2 2 Oozie 2 2 Zookeepe r 2 1 Flume 1 0 Accumulo 2 2 Storm 1 0 Drill 1 0 TOTAL 95 48
    4. 4. Broad Ecosystem Integration Page 4 APPLICATIONSDATASYSTEMSOURCES RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) HANA BusinessObjects BI OPERATIONAL TOOLS DEV & DATA TOOLS Existing Sources (CRM, ERP, Clickstream, Logs) INFRASTRUCTURE
    5. 5. UDA Diagram Relying on Hortonworks… Teradata Portfolio for Hadoop • Seamless data access between Teradata and Hadoop (SQL-H) • Simple management & monitoring with Viewpoint integration • Flexible deployment options Page 5 HDInsight & HDP for Windows • Only Hadoop Distribution for Windows Azure & Windows Server • Native integration with SQL Server, Excel, and System Center • Extends Hadoop to .NET community Complete Portfolio for Hadoop Appliances Instant Access + Infinite Scale • SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP • Enables analytics apps (BOBJ) to interact with Hadoop
    6. 6. HDP 2.1: Enterprise Hadoop Platform Page 6 Hortonworks Data Platform (HDP) • The ONLY 100% open source and most current platform • Integrates full range of enterprise-ready services • Certified and tested at scale • Engineered for deep ecosystem interoperability OS/VM Cloud Appliance CORE SERVICES CORE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS KNOX* OOZIE AMBARI FALCON* YARN MAP TEZREDUCE HIVE & HCATALOG PIGHBASE OPERATIONAL SERVICES DATA SERVICES CORE SERVICES HORTONWORKS DATA PLATFORM (HDP) Schedule Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots Storage Resource Management Process Data Movement Cluster Mgmnt Dataset Mgmnt Data Access CORE SERVICES HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUMEAMBARI FALCON YARN MAP TEZREDUCE HIVEPIG HBASE OOZIE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots LOAD & EXTRACT WebHDFS NFS KNOX*
    7. 7. Our Vision: Hadoop as Next-Gen Platform HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 7
    8. 8. The 1st Generation of Hadoop: Batch HADOOP 1.0 Built for Web-Scale Batch Apps Single App BATCH HDFS Single App INTERACTIVE Single App BATCH HDFS • All other usage patterns must leverage that same infrastructure • Forces the creation of silos for managing mixed workloads Single App BATCH HDFS Single App ONLINE
    9. 9. Hadoop MapReduce Classic • JobTracker –Manages cluster resources and job scheduling • TaskTracker –Per-node agent –Manage tasks Page 9
    10. 10. YARN: Taking Hadoop Beyond Batch Page 10 Applications Run Natively in Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
    11. 11. 5 Key Benefits of YARN 1. Scale 2. New Programming Models & Services 3. Improved cluster utilization 4. Agility 5. Beyond Java Page 11
    12. 12. Concepts • Application –Application is a temporal job or a service submitted YARN –Examples – Map Reduce Job (job) – Hbase Cluster (service) • Container –Basic unit of allocation –Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc.) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU –Replaces the fixed map/reduce slots 12
    13. 13. Design Centre • Split up the two major functions of JobTracker –Cluster resource management –Application life-cycle management • MapReduce becomes user-land library 13
    14. 14. YARN Applications • Data processing applications and services –Online Serving – HOYA (HBase on YARN) –Real-time event processing – Storm, S4, other commercial platforms –Interactive SQL – Tez (Generalization of MR) –Machine Learning – MPI (OpenMPI, MPICH2) –In-Memory: Spark –Graph processing: Giraph –Enabled by allowing the use of paradigm-specific application master Run all on the same Hadoop cluster! Page 14
    15. 15. © Hortonworks Inc. 2012 NodeManager NodeManager NodeManager NodeManager map 1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager map1.2 reduce1.1 Batch vertex1.1.1 vertex1.1.2 vertex1.2.1 Interactive SQL YARN as OS for Data Lake ResourceManager Scheduler Real-Time nimbus0 nimbus1 nimbus2
    16. 16. © Hortonworks Inc. 2012 Multi-Tenant YARN ResourceManager Scheduler root Adhoc 10% DW 60% Mrkting 30% Dev 10% Reserved 20% Prod 70% Prod 80% Dev 20% P0 70% P1 30%
    17. 17. Multi-Tenancy with CapacityScheduler • Queues • Economics as queue-capacity –Hierarchical Queues • SLAs –Preemption • Resource Isolation –Linux: cgroups –MS Windows: Job Control –Roadmap: Virtualization (Xen, KVM) • Administration –Queue ACLs –Run-time re-configuration for queues –Charge-back Page 17 ResourceManager Scheduler root Adhoc 10% DW 70% Mrkting 20% Dev 10% Reserved 20% Prod 70% Prod 80% Dev 20% P0 70% P1 30% Capacity Scheduler Hierarchical Queues
    18. 18. Tez (“Speed”) • What is it? –A data processing framework as an alternative to MapReduce –A new incubation project in the ASF • Who else is involved? –22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft • Why does it matter? –Widens the platform for Hadoop use cases –Crucial to improving the performance of low-latency applications –Core to the Stinger initiative –Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop
    19. 19. Moving Hadoop Beyond MapReduce • Low level data-processing execution engine • Built on YARN • Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS –Much lighter disk and network usage • New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline
    20. 20. Tez - Core Idea Task with pluggable Input, Processor & Output YARN ApplicationMaster to run DAG of Tez Tasks Input Processor Task Output Tez Task - <Input, Processor, Output>
    21. 21. Building Blocks for Tasks MapReduce ‘Map’ MapReduce ‘Reduce’ HDFS Input Map Processor MapReduce ‘Map’ Task Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce Shuffle Input Reduce Processor Intermediate ‘Reduce’ for Map-Reduce-Reduce Sorted Output Shuffle Input Reduce Processor HDFS Output MapReduce ‘Reduce’ Task Special Pig/Hive ‘Map’ HDFS Input Map Processor Tez Task Pipelin e Sorter Output Special Pig/Hive ‘Reduce’ Shuffle Skip- merge Input Reduce Processor Tez Task Sorted Output In-memory Map HDFSI nput Map Processor Tez Task In- memor y Sorted Output
    22. 22. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Pig/Hive - MR Pig/Hive - Tez I/O Synchronization Barrier I/O Synchronization Barrier Job 1 Job 2 Job 3 Single Job
    23. 23. Tez on YARN: Going Beyond Batch Tez Optimizes Execution New runtime engine for more efficient data processing Always-On Tez Service Low latency processing for all Hadoop data processing Tez Task
    24. 24. SQL-in-Hadoop with Apache Hive • Apache Hive is the standard for SQL interaction with Hadoop –Enterprise makes final purchasing decision on two key characteristics: 'compatibility' with existing investments (60%) and skills (20%) –Most application claim Hive compatibility TODAY* • Stinger Initiative: Simple Focus –Performance –SQL-Compatibility –Scalability Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho Page 24 Hadoop HDFS Hive TezMapReduce SQL YARN Business Analytics Custom Apps Improves existing tools & preserves investments
    25. 25. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Hive 0.13, April 2014: • Hive on Apache Tez • Query Service • Buffer Cache • Cost Based Optimizer (Optiq) • Vectorized Processing Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Hive 0.12, October 2013: • VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop Goals:
    26. 26. Hortonworks: The Value of “Open” for You Page 26 Validate & Try 1. Download the Hortonworks Sandbox 2. Learn Hadoop using the technical tutorials 3. Investigate a business case using the step-by- step business cases scenarios 4. Validate YOUR business case using your data in the sandbox Connect With the Hadoop Community We employ a large number of Apache project committers & innovators so that you are represented in the open source community Avoid Vendor Lock-In Hortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments Certified for the Enterprise We engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use Support from the Experts We provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience Engage 1. Execute a Business Case Discovery Workshop with our architects 2. Build a business case for Hadoop today

    ×