Your SlideShare is downloading. ×
YARN - Next Generation Compute Platform fo Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

YARN - Next Generation Compute Platform fo Hadoop

4,257
views

Published on

Latest information on Apache Hadoop YARN from Big Data Camp LA by Bikas Saha

Latest information on Apache Hadoop YARN from Big Data Camp LA by Bikas Saha

Published in: Technology

0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,257
On Slideshare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
350
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Hadoop Next Generation Compute Platform - YARN Bikas Saha @bikassaha © Hortonworks Inc. 2013 Page 1
  • 2. 1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Web-Scale Batch Apps Single App Single App INTERACTIVE ONLINE Single App Single App Single App BATCH BATCH BATCH HDFS HDFS All other usage patterns MUST leverage same infrastructure HDFS © Hortonworks Inc. 2013 - Confidential Forces Creation of Silos to Manage Mixed Workloads Page 2
  • 3. Hadoop 1 Architecture JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks © Hortonworks Inc. 2013 - Confidential Page 3
  • 4. Hadoop 1 Limitations Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Coarse Synchronization in JobTracker Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization Lacks Support for Alternate Paradigms and Services Iterative applications in MapReduce are 10x slower © Hortonworks Inc. 2013 - Confidential Page 4
  • 5. Hadoop 2 - YARN Architecture ResourceManager (RM) Manages and allocates cluster resources Node Manager Central agent NodeManager (NM) App Mstr Manage Tasks, Enforce Allocations Resource Manager Per-Node Agent Node Manager Client Container MapReduce Status Job Submission Node Manager Node Status Resource Request © Hortonworks Inc. 2013 - Confidential Page 5
  • 6. Apache YARN The Data Operating System for Hadoop 2.0 Flexible Efficient Shared Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez ONLINE HBase STREAMING Storm, S4, … GRAPH Giraph MICROSOFT REEF SAS LASR, HPA OTHERS YARN: Cluster Resource Management HDFS2: Redundant, Reliable Storage © Hortonworks Inc. 2013 - Confidential Page 6
  • 7. 5 Key Benefits of YARN 1. New Applications & Services 2. Improved cluster utilization 3. Scale 4. Experimental Agility 5. Shared Services © Hortonworks Inc. 2013 - Confidential Page 7
  • 8. YARN: Efficiency with Shared Services Yahoo! leverages YARN 40,000+ nodes running YARN across over 365PB of data ~400,000 jobs per day for about 10 million hours of compute time Estimated a 60% – 150% improvement on node usage per day using YARN Eliminated Colo (~10K nodes) due to increased utilization © Hortonworks Inc. 2013 - Confidential Page 8
  • 9. Key Improvements in YARN Framework supporting multiple applications – Separate generic resource brokering from application logic – Define protocols/libraries and provide a framework for custom application development – Share same Hadoop Cluster across applications Application Agility and Innovation – Use Protocol Buffers for RPC gives wire compatibility – Map Reduce becomes an application in user space unlocking safe innovation – Multiple versions of an app can co-exist leading to experimentation – Easier upgrade of framework and applications © Hortonworks Inc. 2013 - Confidential Page 9
  • 10. Key Improvements in YARN Scalability – Removed complex app logic from RM, scale further – State machine, message passing based loosely coupled design Cluster Utilization – Generic resource container model replaces fixed Map/Reduce slots. Container allocations based on locality, memory (CPU coming soon) – Sharing cluster among multiple applications Reliability and Availability – Simpler RM state makes it easier to save and restart (work in progress) – Application checkpoint can allow an app to be restarted. MapReduce application master saves state in HDFS. © Hortonworks Inc. 2013 - Confidential Page 10
  • 11. YARN as Cluster Operating System ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager map 1.1 nimbus0 vertex1.1.1 vertex1.2.2 NodeManager NodeManager NodeManager NodeManager map1.2 Batch Interactive SQL vertex1.1.2 nimbus2 NodeManager NodeManager NodeManager NodeManager nimbus1 Real-Time vertex1.2.1 reduce1.1 © Hortonworks Inc. 2013 - Confidential Page 11
  • 12. Multi-Tenancy with Capacity Scheduler • Queues • Economics as queue-capacity – Hierarchical Queues • SLAs ResourceManager – Preemption Scheduler • Resource Isolation – Linux: cgroups – MS Windows: Job Control – Roadmap: Virtualization (Xen, KVM) • Administration – Queue ACLs – Run-time re-configuration for queues – Charge-backs © Hortonworks Inc. 2013 - Confidential root Hierarchical Queues Mrkting 20% Dev 20% Adhoc 10% Prod 80% DW 70% Dev Reserved Prod 10% 20% 70% P0 70% P1 30% Capacity Scheduler Page 12
  • 13. YARN Eco-system Applications Powered by YARN Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative/Interactive applications Elastic Search – Scalable Search Cloudera Llama – Impala on YARN DataTorrent – Data Analysis HOYA – HBase on YARN RedPoint - Data Management © Hortonworks Inc. 2013 - Confidential There's an app for that... YARN App Marketplace! Frameworks Powered By YARN Weave by Continuity REEF by Microsoft Spring support for Hadoop 2 Page 13
  • 14. YARN APIs & Client Libraries Application Client Protocol: Client to RM interaction – Library: YarnClient – Application Lifecycle control – Access Cluster Information Application Master Protocol: AM – RM interaction – Library: AMRMClient / AMRMClientAsync – Resource negotiation – Heartbeat to the RM Container Management Protocol: AM to NM interaction – Library: NMClient/NMClientAsync – Launching allocated containers – Stop Running containers Use external frameworks like Weave/REEF/Spring © Hortonworks Inc. 2013 - Confidential Page 14
  • 15. YARN Application Flow Application Client Protocol Application Client YarnClient App Specific API Resource Manager NodeManager Application Master Protocol App Container Application Master AMRMClient Container Management Protocol NMClient © Hortonworks Inc. 2013 - Confidential Page 15
  • 16. YARN Best Practices Use provided Client libraries Resource Negotiation – You may ask but you may not get what you want - immediately. – Locality requests may not always be met. – Resources like memory/CPU are guaranteed. Failure handling – Remember, anything can fail ( or YARN can pre-empt your containers) – AM failures handled by YARN but container failures handled by the application. Checkpointing – Check-point AM state for AM recovery. – If tasks are long running, check-point task state. © Hortonworks Inc. 2013 - Confidential Page 16
  • 17. YARN Best Practices Cluster Dependencies – Try to make zero assumptions on the cluster. – Your application bundle should deploy everything required using YARN’s local resources. Client-only installs if possible – Simplifies cluster deployment, and multi-version support Securing your Application – YARN does not secure communications between the AM and its containers. © Hortonworks Inc. 2013 - Confidential Page 17
  • 18. Testing/Debugging your Application MiniYARNCluster Regression tests Unmanaged AM Support to run the AM outside of a YARN cluster for manual testing Logs Log aggregation support to push all logs into HDFS Accessible via CLI, UI © Hortonworks Inc. 2013 - Confidential Page 18
  • 19. YARN Future Work ResourceManager High Availability and Work-preserving restart – Work-in-Progress Scheduler Enhancements – SLA Driven Scheduling, Low latency allocations – Multiple resource types – disk/network/GPUs/affinity Rolling upgrades Long running services – Better support to running services like HBase – Discovery of services, upgrades without downtime More utilities/libraries for Application Developers – Failover/Checkpointing © Hortonworks Inc. 2013 - Confidential Page 19
  • 20. Key Take-Aways YARN - Distributed Application Framework to build/run Multiple Applications (original being MapReduce) YARN is completely Backwards Compatible for existing MapReduce apps YARN Allows Different Applications to Share the Same Cluster YARN Enables Fine Grained Resource Management via Generic Resource Containers YARN Provides Better Control over Application Upgrades via Wire Compatibility © Hortonworks Inc. 2013 - Confidential Page 20
  • 21. Apache YARN The Data Operating System for Hadoop 2.0 Flexible Efficient Shared Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez ONLINE HBase STREAMING Storm, S4, … GRAPH Giraph MICROSOFT REEF SAS LASR, HPA OTHERS YARN: Cluster Resource Management HDFS2: Redundant, Reliable Storage © Hortonworks Inc. 2013 - Confidential Page 21
  • 22. Thank you! http://hortonworks.com/products/hortonworks-sandbox/ Download Sandbox: Experience Apache Hadoop Both 2.0 and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Additional Questions? © Hortonworks Inc. 2013 - Confidential Page 22