More Related Content


Recently uploaded(20)


Hadoop YARN overview

  1. Hadoop YARN Arnon Rotem-Gal-Oz Director of technology research, Amdocs
  2. let’s start by reviewing Map/Reduce
  3. ok, ok - K-Means
  4. K=3
  5. def perform_kmeans(): isStillMoving = 1 initialize_centroids() while(isStillMoving): recalculate_centroids() isStillMoving = update_clusters() return
  6. put differently In the map phase • Read the cluster centers into memory from a File/HBase • Iterate over each cluster center for each input key/value pair. • Measure the distances and save the nearest center which has the lowest distance to the vector • Write the clustercenter with its vector to the context. In the reduce phase • Iterate over each value vector and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received). • This is the new center, save it into a File/HBase. • Check if we need the different between runs is < epsilon Run this whole damn thing again and again until diff < epsilon
  7. So you want to implement it differently something that doesn’t rely on map/reduce
  8. Resource management is tied to Map/Reduce v
  9. Yet Another Resource Negotiator
  10. YARN takes Hadoop beyond Map/Reduce
  11. What is an OS?
  12. “…the function of the operating system is to present the user with the equivalent of an extended machine or virtual machine that is easier to program that the underlying hardware”
  13. “…The Operating System as a Resource Manager… the job of the operating system is to provide for an orderly and controlled allocation of the processors, memories, and/or devices among the various programs competing for them”
  14. With YARN Hadoop becomes a distributed OS
  15. The Resource Manager is essentially a scheduler
  16. Containers are allocations of physical resources
  17. Each app instance spawns an application manager (container 0) - to negotiate resource and and monitor app progress (tasks)
  18. Node managers monitor nodes and manage containers lifecycle
  19. Application Initiation (or “how to get an App running in 11 easy steps”)
  20. 1. Client submits a job/app
  21. 2. Resource Manager (RM) provides Application Id
  22. 3. Client provides context (queue, resource requirements, files, security tokens etc.)
  23. 4. RM asks Node Manager to launch Application Master
  24. 5. Node Manager launches Application Master
  25. 6. Application Master registers with RM
  26. 7. RM shares resource capabilities with Application Master
  27. 8. Application Master requests containers
  28. 9. RM assigns containers based on policies and available resources
  29. 10. Application Master contacts assigned node mageres to instantiate containers (passing container contexts)
  30. 11. Node Manager initiate container(s)
  31. Congratulations your Application is now running
  32. Application Progress (or “it doesn’t end there”)
  33. 1. continuous heartbeat & progress report 2. Request container status 3. Status response 1 2 3 Monitoring
  34. 1. Heartbeat also carries request for new container allocations / container releases 2. Application master connects to node manger to activate allocated containers 3. Container releases go through the Resource Manager 1 2 3 Lifecycle
  35. YARN HA (or “and you thought we were done”)
  36. Still a work in progress • Resource Manager - YARN 149 (patch available) • Application Manager - YARN 1489 • Node Manager - YARN 1336
  37. 1. Active/Standby
  38. YARN Limitation: Manages memory & CPU but not Disk IO or Network
  39. YARN limitations YARN Limitations: Batch Focus
  40. YARN Limitation : Not daemon friendly
  41. YARN Limitation: Relatively complex to develop for
  42. Apache Slider is an effort to mitigate YARN gaps
  43. Further Reading YARN Source code common/tree/trunk/hadoop-yarn-project/hadoop- yarn Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Hadoop 2 by Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec & Jeff Markham Apache YARN site yarn/hadoop-yarn-site/YARN.html Apache Slider
  44. Image attributions: slide 1 - Hadoop logo - slide 3 - adopted from pulp fiction slide 4-5 slide 10 - Elizabeth Moreau slide 11 slide 14 slide 32 slide 39 slide 40 - slide 41 Slide 42 By Christina Quinn Slide 43 By Paul L Dineen

Editor's Notes

  1. Also the partitioning to map and reduce is fixed and rigid (i.e. inefficient) And there are other limitations like max cluster size ~5000 nodes, up to 4000 processes (not a problem for most of us but still :) )
  2. Can’t balance effectively stream processes , doesn’t handle well point queries (impala, HBase)
  3. Doesn’t handle well long running processes (related to “batch focus”) Doesn’t work well with other daemons (see Cloudera Llama )
  4. Use frameworks : Spark, Pig etc. Or when you want to develop an App: Apache twill, Spring Hadoop etc.
  5. Still incubating (as of June 2014) 1. Simplify taking existing code unto YARN 2. Full capabilities of a YARN application - start/stop/multi-version, monitor, expand, security etc. (integrated with Ambari - > Nice move by HortonWorks to strengthen Ambari) Slider allows long-running applications, real-time services and online applications to easily integrate into a Hadoop environment.