Apache Hadoop
YARN - Under the Hood


      Sharad Agarwal
    sharad@apache.org
About me
• Apache Foundation
  – Hadoop Committer and PMC member
  – Hadoop MR contributor for over 4 years
  – Author of Hadoop Nextgen/YARN core and MR
    Application Master
  – Organizer of Hadoop Bangalore Meetups

• Head of Technology Platforms @InMobi
  – Formerly Architect @Yahoo!
• Component
              Architecture

YARN CORE   • Concurrency

            • State
              Management
Concurrent
 Systems
Concurrent System ?
// Loop through all expired items in the queue
//
// Need to lock the JobTracker here since we are
// manipulating it's data-structures via
// ExpireTrackers.run -> JobTracker.lostTaskTracker ->
// JobInProgress.failedTask -> JobTracker.markCompleteTaskAttempt
// Also need to lock JobTracker before locking 'taskTracker' &
// 'trackerExpiryQueue' to prevent deadlock




// This method is synchronized to make sure that the locking order
// "taskTrackers lock followed by faultyTrackers.potentiallyFaultyTrackers
// lock" is under JobTracker lock to avoid deadlocks.
JobTracker         JIP                TIP   Scheduler

 Heartbeat
 Request




Heartbeat
Response




                         JobTracker Global Lock
Highly Concurrent Systems
• scales much better (if done right)
• makes effective use of multi-core hardwares
• even more important in master slave
  architectures
• overall job latencies of the systems can come
  down drastically
• managing eventual consistency of states hard
• need for a systemic framework to manage this
Event Queue                   Event
                                                        Dispatcher




        Component          Component        Component
           A                  B                N

•   Mutations only via events
•   Components only expose Read APIs
•   Use Re-entrant locks
•   Components follow clear lifecycle


                              Event Model
Heartbeat
              Listener       Event Q            NM Info

 Heartbeat
 Request




                                       Get commands




Heartbeat
Response




                 Asynchronous
               Heartbeat Handling
RM                            NM

                 Heartbeats


                     MR Scheduler                 Container Launcher




                                    Event Queue




Client Handler    Job                Task         Attempt          Task Listener


                                                                          Heartbeats


Job Client                         MR AM                               Task
State
Management
Looks Familiar ?
public synchronized void updateTaskStatus(TaskInProgress tip,
                       TaskStatus status) {

  double oldProgress = tip.getProgress(); // save old progress
  boolean wasRunning = tip.isRunning();
                                                                              Very Hard to Maintain
  boolean wasComplete = tip.isComplete();
  boolean wasPending = tip.isOnlyCommitPending();                             Debugging even harder
  TaskAttemptID taskid = status.getTaskID();
  boolean wasAttemptRunning = tip.isAttemptRunning(taskid);

  // If the TIP is already completed and the task reports as SUCCEEDED then
  // mark the task as KILLED.
  // In case of task with no promotion the task tracker will mark the task
  // as SUCCEEDED.
  // User has requested to kill the task, but TT reported SUCCEEDED,
  // mark the task KILLED.
  if ((wasComplete || tip.wasKilled(taskid)) &&
     (status.getRunState() == TaskStatus.State.SUCCEEDED)) {
    status.setRunState(TaskStatus.State.KILLED);
  }

  // If the job is complete and a task has just reported its
  // state as FAILED_UNCLEAN/KILLED_UNCLEAN,
  // make the task's state FAILED/KILLED without launching cleanup attempt.
  // Note that if task is already a cleanup attempt,
  // we don't change the state to make sure the task gets a killTaskAction
  if ((this.isComplete() || jobFailed || jobKilled) &&
      !tip.isCleanupAttempt(taskid)) {
    if (status.getRunState() == TaskStatus.State.FAILED_UNCLEAN) {
      status.setRunState(TaskStatus.State.FAILED);
    } else if (status.getRunState() == TaskStatus.State.KILLED_UNCLEAN) {
      status.setRunState(TaskStatus.State.KILLED);
    }
  }
Complex State Management
• Light weight State Machines Library
  – Declarative way of specifying the state Transitions
  – Invalid transitions are handled automatically
  – Fits nicely with the event model
  – Debuggability is drastically improved. Lineage of
    object states can easily be determined
  – Handy while recovering the state
Declarative State Machine
StateMachineFactory<JobImpl, JobState, JobEventType, JobEvent> stateMachineFactory
    = new StateMachineFactory<JobImpl, JobState, JobEventType, JobEvent>(JobState.NEW)

     // Transitions from NEW state
     .addTransition(JobState.NEW, JobState.NEW,
        JobEventType.JOB_DIAGNOSTIC_UPDATE,
        DIAGNOSTIC_UPDATE_TRANSITION)
     .addTransition(JobState.NEW, JobState.NEW,
        JobEventType.JOB_COUNTER_UPDATE, COUNTER_UPDATE_TRANSITION)
     .addTransition
        (JobState.NEW,
        EnumSet.of(JobState.INITED, JobState.FAILED),
        JobEventType.JOB_INIT,
        new InitTransition())
     .addTransition(JobState.NEW, JobState.KILLED,
        JobEventType.JOB_KILL,
        new KillNewJobTransition())
YARN: New Possibilities
•   Open MPI - MR-2911
•   Master-Worker – MR-3315
•   Distributed Shell
•   Graph processing – Giraph-13
•   BSP – HAMA-431
•   CEP – Storm
     • https://github.com/nathanmarz/storm/issues/74
• Iterative processing - Spark
     • https://github.com/mesos/spark-yarn/
Thank You!
@twitter: sharad_ag

YARN Under The Hood

  • 1.
    Apache Hadoop YARN -Under the Hood Sharad Agarwal sharad@apache.org
  • 2.
    About me • ApacheFoundation – Hadoop Committer and PMC member – Hadoop MR contributor for over 4 years – Author of Hadoop Nextgen/YARN core and MR Application Master – Organizer of Hadoop Bangalore Meetups • Head of Technology Platforms @InMobi – Formerly Architect @Yahoo!
  • 3.
    • Component Architecture YARN CORE • Concurrency • State Management
  • 4.
  • 5.
    Concurrent System ? //Loop through all expired items in the queue // // Need to lock the JobTracker here since we are // manipulating it's data-structures via // ExpireTrackers.run -> JobTracker.lostTaskTracker -> // JobInProgress.failedTask -> JobTracker.markCompleteTaskAttempt // Also need to lock JobTracker before locking 'taskTracker' & // 'trackerExpiryQueue' to prevent deadlock // This method is synchronized to make sure that the locking order // "taskTrackers lock followed by faultyTrackers.potentiallyFaultyTrackers // lock" is under JobTracker lock to avoid deadlocks.
  • 6.
    JobTracker JIP TIP Scheduler Heartbeat Request Heartbeat Response JobTracker Global Lock
  • 7.
    Highly Concurrent Systems •scales much better (if done right) • makes effective use of multi-core hardwares • even more important in master slave architectures • overall job latencies of the systems can come down drastically • managing eventual consistency of states hard • need for a systemic framework to manage this
  • 8.
    Event Queue Event Dispatcher Component Component Component A B N • Mutations only via events • Components only expose Read APIs • Use Re-entrant locks • Components follow clear lifecycle Event Model
  • 9.
    Heartbeat Listener Event Q NM Info Heartbeat Request Get commands Heartbeat Response Asynchronous Heartbeat Handling
  • 10.
    RM NM Heartbeats MR Scheduler Container Launcher Event Queue Client Handler Job Task Attempt Task Listener Heartbeats Job Client MR AM Task
  • 11.
  • 12.
    Looks Familiar ? publicsynchronized void updateTaskStatus(TaskInProgress tip, TaskStatus status) { double oldProgress = tip.getProgress(); // save old progress boolean wasRunning = tip.isRunning(); Very Hard to Maintain boolean wasComplete = tip.isComplete(); boolean wasPending = tip.isOnlyCommitPending(); Debugging even harder TaskAttemptID taskid = status.getTaskID(); boolean wasAttemptRunning = tip.isAttemptRunning(taskid); // If the TIP is already completed and the task reports as SUCCEEDED then // mark the task as KILLED. // In case of task with no promotion the task tracker will mark the task // as SUCCEEDED. // User has requested to kill the task, but TT reported SUCCEEDED, // mark the task KILLED. if ((wasComplete || tip.wasKilled(taskid)) && (status.getRunState() == TaskStatus.State.SUCCEEDED)) { status.setRunState(TaskStatus.State.KILLED); } // If the job is complete and a task has just reported its // state as FAILED_UNCLEAN/KILLED_UNCLEAN, // make the task's state FAILED/KILLED without launching cleanup attempt. // Note that if task is already a cleanup attempt, // we don't change the state to make sure the task gets a killTaskAction if ((this.isComplete() || jobFailed || jobKilled) && !tip.isCleanupAttempt(taskid)) { if (status.getRunState() == TaskStatus.State.FAILED_UNCLEAN) { status.setRunState(TaskStatus.State.FAILED); } else if (status.getRunState() == TaskStatus.State.KILLED_UNCLEAN) { status.setRunState(TaskStatus.State.KILLED); } }
  • 14.
    Complex State Management •Light weight State Machines Library – Declarative way of specifying the state Transitions – Invalid transitions are handled automatically – Fits nicely with the event model – Debuggability is drastically improved. Lineage of object states can easily be determined – Handy while recovering the state
  • 15.
    Declarative State Machine StateMachineFactory<JobImpl,JobState, JobEventType, JobEvent> stateMachineFactory = new StateMachineFactory<JobImpl, JobState, JobEventType, JobEvent>(JobState.NEW) // Transitions from NEW state .addTransition(JobState.NEW, JobState.NEW, JobEventType.JOB_DIAGNOSTIC_UPDATE, DIAGNOSTIC_UPDATE_TRANSITION) .addTransition(JobState.NEW, JobState.NEW, JobEventType.JOB_COUNTER_UPDATE, COUNTER_UPDATE_TRANSITION) .addTransition (JobState.NEW, EnumSet.of(JobState.INITED, JobState.FAILED), JobEventType.JOB_INIT, new InitTransition()) .addTransition(JobState.NEW, JobState.KILLED, JobEventType.JOB_KILL, new KillNewJobTransition())
  • 16.
    YARN: New Possibilities • Open MPI - MR-2911 • Master-Worker – MR-3315 • Distributed Shell • Graph processing – Giraph-13 • BSP – HAMA-431 • CEP – Storm • https://github.com/nathanmarz/storm/issues/74 • Iterative processing - Spark • https://github.com/mesos/spark-yarn/
  • 17.

Editor's Notes

  • #6 Does this look fragile. No one dares to touch this piece
  • #13 Does this look fragile. No one dares to touch this piece
  • #15 New Features and their impact in State Transitions can be articulated, discussed and designed