• Save
Get Started Building YARN Applications
Upcoming SlideShare
Loading in...5
×
 

Get Started Building YARN Applications

on

  • 3,954 views

Hortonworks Get Started Building YARN Applications Dec. 2013. We cover YARN basics, benefits, getting started and roadmap. Actian shares their experience and recommendations on building their ...

Hortonworks Get Started Building YARN Applications Dec. 2013. We cover YARN basics, benefits, getting started and roadmap. Actian shares their experience and recommendations on building their real-world YARN application.

Statistics

Views

Total Views
3,954
Slideshare-icon Views on SlideShare
3,945
Embed Views
9

Actions

Likes
20
Downloads
0
Comments
0

2 Embeds 9

https://twitter.com 8
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  • As Arun mentioned there are less JVMs to spin up per job management (1 instead of 3) as well as the RM and NM provisioning being fasterOriginally conceived & architected by the team at Yahoo!Arun Murthy created the original JIRA in 2008 and led the PMCThe team at Hortonworks has been working on YARN for 4 years: 90% of code from Hortonworks & Yahoo!YARN based architecture running at scale at Yahoo!Deployed on 35,000 nodes for 6+ monthsMultitude of YARN applications*********************On great public example of in production use of YARN, is at Yahoo!. They outlined some performance gains in a keynote address at Hadoop Summit this year. Yahoo uses YARN for three use cases, stream processing, iterative processing and shared storage. With Storm on YARN they stream data into a cluster and execute 5 second analytics windows. This cluster is only 320 nodes, but is processing 133,000 events per second and is executing 12000 threads. Their shared data cluster uses 1900 nodes to store 2PB of data.In all, Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day.ANDAt this point, over 50,000 Hadoop nodes have been upgraded at Yahoo from Hadoop 1.0 to Hadoop 2, yielding 50% improvement in cluster utilization & efficiency.This should be a big deal in terms of potential ROI.
  • HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling
  • HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling

Get Started Building YARN Applications Get Started Building YARN Applications Presentation Transcript

  • Getting Started Writing YARN Applications © Hortonworks Inc. 2013 Page 1
  • Agenda • Overview and Benefits • YARN Basics • Guest Speaker: Actian – Developing a Real World YARN Application • Getting Started • Roadmap © Hortonworks Inc. 2013 - Confidential Page 2
  • Apache Hadoop Release Info October • Apache Hadoop 2.2.0 GA 15 October 23 • Hortonworks Data Platform 2.0 – Based on Apache Hadoop 2.2.0 “Foundation of next-generation Open Source Big Data Cloud computing platform runs multiple applications simultaneously to enable users to quickly and efficiently leverage data in multiple ways at supercomputing speed” Apache Software Foundation Blog “Hadoop 2.0 Makes Big Data Even More Accessible” ReadWrite.com “Apache Software Foundation announces general availability of watershed Big Data release ” Yarn Wins Best Paper Award at SOCC-2013 ZDNet SOCC-2013 © Hortonworks Inc. 2013 - Confidential Page 3
  • 1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single App Single App INTERACTIVE ONLINE Single App Single App Single App MapReduce MapReduce MapReduce HDFS HDFS All other usage patterns MUST leverage same infrastructure HDFS © Hortonworks Inc. 2013 - Confidential Forces Creation of Silos to Manage Mixed Workloads Page 4
  • Hadoop 1 Limitations • Lacks Support for Alternate Paradigms and Services – Force everything needs to look like Map Reduce – Iterative applications in MapReduce are 10x slower • Scalability – Max Cluster size ~5,000 nodes – Max concurrent tasks ~40,000 • Availability – Failure Kills Queued & Running Jobs • Hard partition of resources into map and reduce slots – Non-optimal Resource Utilization © Hortonworks Inc. 2013 - Confidential Page 5
  • Our Vision: Hadoop as Multi-Workload Platform Single Use System Multi Purpose Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce Others (data processing) MapReduce YARN (cluster resource management & data processing) (cluster resource management) HDFS HDFS2 (redundant, reliable storage) (redundant, highly-available & reliable storage) © Hortonworks Inc. 2013 - Confidential Page 6
  • Apache YARN Benefits The Data Operating System for Hadoop 2.0 Flexible Efficient Shared Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Increase processing IN Hadoop on the same hardware while providing predictable performance & quality of service Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez ONLINE HBase STREAMING Storm, S4, … GRAPH Giraph MICROSOFT REEF SAS LASR, HPA OTHERS YARN: Cluster Resource Management HDFS2: Redundant, Reliable Storage © Hortonworks Inc. 2013 - Confidential Page 7
  • YARN: Efficiency with Shared Services Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% – 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization For more details check out the YARN SOCC 2013 paper © Hortonworks Inc. 2013 - Confidential Page 8
  • YARN Basics © Hortonworks Inc. 2013 Page 9
  • Hadoop 2 - YARN Architecture  ResourceManager (RM) Node Manager Central agent - Manages and allocates cluster resources App Mstr  NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations Resource Manager Node Manager Client Container  User Application Client Submits the applications ApplicationMaster (AM) MapReduce Status Job Submission Node Manager Node Status Resource Request Manages application lifecycle and task scheduling Container Application Executes application logic © Hortonworks Inc. 2013 - Confidential Page 10
  • Containers • Capability – Memory, CPU • Container Request – Capability, Host, Rack, Priority, relaxLocality • Container Launch Context – LocalResources – Resources needed to execute container application – Environment variables – Example: classpath – Command to execute • Launch the container – Client requests Resource Manager to launch Application Master Container – Application Master requests Node Manager to launch Application Containers © Hortonworks Inc. 2013 - Confidential Page 11
  • APIs • What APIs do I need to use? – Only three protocols Application Client Protocol – Client to ResourceManager Resource Manager – Application submission – ApplicationMaster to ResourceManager – Container allocation – ApplicationMaster to NodeManager Application Client Application Master Protocol YarnClient NodeManage r App Contain er Application Master – Container launch AMRMClient – Use client libraries for all 3 actions NMClient – Package org.apache.hadoop.yarn.client.api; – Provides both synchronous and asynchronous libraries Container Management Protocol – Use 3rd party libraries like Twill, Reef, Spring © Hortonworks Inc. 2013 - Confidential 12
  • Developing a Real World YARN Application © Hortonworks Inc. 2013 Page 13
  • Jeff Gullick – Principal Solutions Engineer Shane Pratt - Sr. Director, Hadoop and Analytics COE Jim Falgout – Chief Technologist Actian and YARN 12/18/13
  • Actian “Dataflow” Technology …a series of analytic, ETL, data quality applications based on parallel dataflow technology that eliminate performance bottlenecks in data-intensive operations Actian “Dataflow” Applications • Native Hadoop Execution: Alternative execution engine to MapReduce that runs local to the Hadoop cluster • High Throughput: Pipeline parallelism executes up to 500% faster than MapReduce; Parallel readers and writers • Auto-Scaling: Performance dynamically scales with increased core counts and increased Hadoop nodes. • Cost Efficient: Designed for maximum performance from commodity multicore servers and Hadoop clusters. Hadoop Cluster Fully Integrated: A single platform and user experience for ETL, data quality, and data science. Cluster • Server Easy to Implement: GUI and API-level interfaces; eliminates the need to understand MapReduce or complex parallel processing. Multicore • Actian “Dataflow” Engine Dataflow Apps Scale Up and Out Confidential © 2013 Actian Corporation 15
  • Why Actian Needs YARN….  Potential resource competition concerns between MapReduce applications and Dataflow on the Hadoop cluster were preventing market uptake of the technology Confidential © 2012 Actian Corporation 16
  • Hortonworks & Actian Analytics and DataPrep for Hadoop Reference Architecture AMBARI DATA REFINEMENT DEVELOPMENT METHODS Analytics and DataPrep for Hadoop SOURCE DATA DISCOVER TRANSFORM STANDARDIZE MATCH-MERGE VISUAL UI OR NATIVE HADOOP PARALLEL EXECUTION Databases / Marts Warehouses JAVA, JAVASC RIPT Dataflow Engine OPEN API/SDK Enterprise Applications 10X Cloud / SaaS Applications DATA SYSTEMS HDFS API HDFS API HDFS HBASE API HBASE API HCATALOG MASSIVELY PARALLEL EXTRACT/LOAD Structured & Unstructured Data MASSIVELY PARALLEL EXTRACT/LOAD YARN ANALYTIC DATASTORES 10X MDM EDW © Hortonworks Inc. 2013 - Confidential
  • Developing with YARN  Getting started • Investigation  Installed HDP 2.0 on development cluster  Read Hortonworks blogs on YARN (very informative!) http://hortonworks.com/blog/introducing-apache-hadoop-yarn/  Looked at sample YARN application code  Browsed MapReduce source code • Prototyping  Started with getting an Application Master spawned  Relatively easy way to get started with the YARN API’s  Also helped to learn about containers and shared resources • Project implemented by two senior developers Page
  • Developing with YARN  Design • Using AMRMAsnycClient  Handles communication with resource manager  Provides callbacks for asynchronous container events (allocations, completions, …) • Using NMClientAsync  Handles communications with multiple node managers  Callbacks for asynchronous container events • Configuration  Reusing existing Actian web application for configuration • Application Specific History Service  Reusing existing Actian web application for job monitoring Page
  • Developing with YARN  Design • Application Master  Started per Actian Dataflow job (batch mode)  Determines resources needed; acquires from ResourceManager  Elastically allocates resources according to job needs  Launches worker containers via NodeManager(s)  Monitors progress and cleans up as job completes • Application Containers  Execute distributed Dataflow graphs within launched container(s)  Provide runtime status and statistics to history server  Statistics include items like: records processed, I/O stats, … Page
  • Developing with YARN Client launch AppMaster YARN Web app Resource Manager launches Links to Allocate resources Application Master get stats launch Worker Containers Node Node Manager Node Manager Manager Config/ History Server get stats launches Application Application Container Application Container Container Page
  • Developing with YARN  Phases of Development • Job launching  Integrated Actian Dataflow client with YARN to launch application master  Built application master: allocate resources; launch workers  Built worker containers  Result: able to launch Dataflow jobs via YARN  1 senior developer; approximately 5 weeks (including investigation) • Configuration and Monitoring  Modified existing web application to handle Dataflow configuration items specific to YARN  Collect and display runtime stats from executing jobs  Provide history service  Log viewing  1 senior developer; approximately 3 weeks Page
  • Developing with YARN  Lessons Learned • Distributed cache allows frictionless install of Actian software on cluster worker nodes • The sample YARN application is too simple • (Hortonworks now has a MemcacheD on YARN sample app) • MapReduce code provides better coverage but is complex • An application history server is required  We hoped to not have install/run any Actian servers on cluster  A JIRA issue exists to provide a history service as part of YARN • Configuration can be supplied via Hadoop config files  This is messy (how to keep coherent across the cluster …)  Applications should integrate with Hadoop management layers (i.e. Ambari) Page
  • Developing with YARN  Next Steps • Integrate with Hadoop management & configuration capabilities • Utilize YARN History Service when it is available • More complex resource allocation schemes Confidential © 2012 Actian Corporation 24
  • Thank You www.actian.com facebook.com/actiancorp @actiancorp CTA: For more information on Hadoop solutions from Actian, please visit: www.actian.com/hadoop Questions on Data Flow? Email: Shane.Pratt@actian.com” Confidential © 2012 Actian Corporation 25
  • YARN – Getting Started © Hortonworks Inc. 2013 Page 27
  • Hortonworks.com/get-started/YARN Step 1 Step 2 Step 3 • Understand the motivations and YARN architecture • Explore example applications on YARN • Examine real world applications on YARN  Setup HDP 2.0 environment  Leverage Sandbox  Review Sample Code & Execute Simple YARN Application  https://github.com/hortonworks/simple-yarn-app BUILD FLEXIBLE, SCALABLE, RESILIENT & POWERFUL APPLICATIONS TO RUN IN HADOOP © Hortonworks Inc. 2013 - Confidential Page 28
  • YARN – Road Ahead © Hortonworks Inc. 2013 Page 29
  • YARN – Roadmap • ResourceManager High Availability – Automatic failover – Work preserving failover • Scheduler Enhancements – SLA Driven Scheduling, Low latency allocations – Multiple resource types – disk/network/GPUs/affinity • Rolling upgrades • Generic History Service • Long running services – Better support to running services like HBase – Service Discovery • More utilities/libraries for Application Developers – Failover/Checkpointing © Hortonworks Inc. 2013 - Confidential Page 30
  • 1-2-3 Getting Started with YARN http://hortonworks.com/get-started/YARN Get started with Hortonworks Sandbox http://hortonworks.com/sandbox/ Code walk through – Jan. 22nd 2014 at 9am PT Register at Hortonworks.com/webinars/yarn-code Get involved! YARN is part of a community driven open source project and you can help accelerate the innovation! Follow Us: @hortonworks @actiancorp