Mumak Using Simulation for Large-scale Distributed System Verification and Debugging Hong Tang 2009.10 - Hadoop User Group
Outline Motivations Overview and Status Architecture Demo Lessons and Experiences Conclusions and Future Work
Motivations Large-scale distributed system is hard to verify and debug Cannot afford a 2000-node cluster for every developer, feature enhancement, and bug fix Time consuming to run benchmarks Hard to reproduce production workload Hard to reproduce corner case conditions
Motivations (cont.) JobTracker is a fertile area for experimentation Scheduling policies – we have four schedulers already Synergy with HDFS block placement policies Speculative execution policies We want more people to help us innovate! But, JobTracker is too complex to modify correctly Many factors to consider: fairness, capacity/SLA guarantees, data locality, load balance, failure handling and recovery, etc Many control knobs in current implementation with subtle interactions
Mumak Discrete-event simulation Can simulate a cluster with thousands of nodes in one process Does not perform actual IO or computation Virtual clock “spins” faster than wall clock Can reproduce behavior/performance with degree of confidence Plugging in the  real  JobTracker and Scheduler No need to reimplement the scheduling policies Inherit both features and bugs in JT and Scheduler Simulate all conditions of a production cluster Workload and cluster configuration generated by Rumen Job submission, inter-arrival, dependencies, high-ram jobs, task exec All kinds of failures and failure recovery logic Resource contention
Project Status Work-in-progress First-cut version committed to Hadoop 0.21 Basic framework Simplified task execution No modeling of resource utilization or contention Only individual task failures, no node failures or failure correlations No job dependencies, nor speculative execution The Team Core devs: Arun Murthy, Anirban Dasgupta, Tamas Sarlos, Guanying Wang, Hong Tang Collaborators: Dick King, Chris Douglas, Owen O’Malley
Architecture Client Protocol InterTracker Protocol Job Tracker Simulated  Job Tracker Sched Task Tracker Task Tracker Task Tracker Simulated  Job Client Job Client Job Client Job Client Simulation Engine Rumen Cluster Story Job Story Trace Job Story Cache Simulated  Task Tracker Simulated  Task Tracker Simulated  Task Tracker JobSubmissionEvent HeartBeatEvent TaskAttempt CompletionEvent JobCompletionEvent Job Finalization Event Queue
DEMO Build hadoop-mapreduce % ant package Run with checked in traces % cd build/hadoop-0.22.0-dev % contrib/mumak/bin/mumak.sh \ src/contrib/mumak/src/test/data/19-jobs.trace.json.gz \ src/contrib/mumak/src/test/data/19-jobs.topology.json.gz
Implementation Experience JobTracker is reasonably modular and amenable to a simulated environment RPC, Clock, DNS-Switch mapping are all interfaces No sleep() in main JT code Usage of threads is localized and easy to factor out Asynchronous job initialization: make them synchronous (AsjectJ) Inheritance is necessary to extend/alter the behavior JobTracker, JobInProgress, LaunchTaskAction, TaskTrackerStatus Convey extra information: virtual time, task execution time, etc Keep up with the base classes change may be hard Example: A new variable added to JobTracker Make dependency explicit between map & reduce tasks
Mumak as a System Behavior Verifier
Mumak as a JobTracker Debugger MAPREDUCE-995: “ JobHistory should handle cases where task completion events are generated after job completion event ” Discovered when testing Mumak patch for 21 submission Introduced by the MAPREDUCE-157, committed one day earlier Manifested as JobTracker crash due to IOException Root cause analysis Developer made wrong assumption of the timing of events Assumed that when a job is marked as finished, no more heartbeat events related to the job would follow Lead to a Closable object being used after it is closed To reproduce through benchmarking: need to inject a failed job and encounter “good” timing when an outstanding task completes after the job is marked as failed
Mumak as a JobTracker Profiling Benchmark Memory allocation pattern similar to real JobTracker, but at much faster rate Mumak overhead is less than 20-30% Limitations: Cannot detect synchronization hotspots or sub-optimal IO or network operations Findings through YourKit profiling Wasteful String concatenations in Log.debug() statements in mapred.ResourceEstimator.getEstimatedTotalMapOutputSize Repetitive parsing of TaskTracker names to extract hostnames Unnecessary exceptions from counter localization due to a removed properties file (regression introduced by H-5717)
Conclusions Mumak: A light-weight, versatile tool for MapReduce verification and debugging Verification of overall system behavior A debugger for JobTracker / scheduler A micro-benchmark to stress CPU and memory allocation Strengths:  Easy to setup and run Faster than running real benchmark: 1 min ~~ 2 hrs on a 2000-node cluster Realistically reproduce conditions and test actual code Can easily generate variants of ordering of distributed events Limitations: No simulation of system services or threads Cannot debug synchronization problems among threads Cannot reproduce OS-induced failures
What Next? Simulate more conditions Speculative execution Resource contentions Node failures Job dependencies Debug issues not resulting in hard-stop failures Fairness violation, starvation, utilization problems Patch validating, before and after comparison Making sure the patch does what is supposed to do, and does not introduce negative side effects Use Mumak to stage unit tests Construct testcases by building synthetic job stories
QUESTIONS?

Mumak

  • 1.
    Mumak Using Simulationfor Large-scale Distributed System Verification and Debugging Hong Tang 2009.10 - Hadoop User Group
  • 2.
    Outline Motivations Overviewand Status Architecture Demo Lessons and Experiences Conclusions and Future Work
  • 3.
    Motivations Large-scale distributedsystem is hard to verify and debug Cannot afford a 2000-node cluster for every developer, feature enhancement, and bug fix Time consuming to run benchmarks Hard to reproduce production workload Hard to reproduce corner case conditions
  • 4.
    Motivations (cont.) JobTrackeris a fertile area for experimentation Scheduling policies – we have four schedulers already Synergy with HDFS block placement policies Speculative execution policies We want more people to help us innovate! But, JobTracker is too complex to modify correctly Many factors to consider: fairness, capacity/SLA guarantees, data locality, load balance, failure handling and recovery, etc Many control knobs in current implementation with subtle interactions
  • 5.
    Mumak Discrete-event simulationCan simulate a cluster with thousands of nodes in one process Does not perform actual IO or computation Virtual clock “spins” faster than wall clock Can reproduce behavior/performance with degree of confidence Plugging in the real JobTracker and Scheduler No need to reimplement the scheduling policies Inherit both features and bugs in JT and Scheduler Simulate all conditions of a production cluster Workload and cluster configuration generated by Rumen Job submission, inter-arrival, dependencies, high-ram jobs, task exec All kinds of failures and failure recovery logic Resource contention
  • 6.
    Project Status Work-in-progressFirst-cut version committed to Hadoop 0.21 Basic framework Simplified task execution No modeling of resource utilization or contention Only individual task failures, no node failures or failure correlations No job dependencies, nor speculative execution The Team Core devs: Arun Murthy, Anirban Dasgupta, Tamas Sarlos, Guanying Wang, Hong Tang Collaborators: Dick King, Chris Douglas, Owen O’Malley
  • 7.
    Architecture Client ProtocolInterTracker Protocol Job Tracker Simulated Job Tracker Sched Task Tracker Task Tracker Task Tracker Simulated Job Client Job Client Job Client Job Client Simulation Engine Rumen Cluster Story Job Story Trace Job Story Cache Simulated Task Tracker Simulated Task Tracker Simulated Task Tracker JobSubmissionEvent HeartBeatEvent TaskAttempt CompletionEvent JobCompletionEvent Job Finalization Event Queue
  • 8.
    DEMO Build hadoop-mapreduce% ant package Run with checked in traces % cd build/hadoop-0.22.0-dev % contrib/mumak/bin/mumak.sh \ src/contrib/mumak/src/test/data/19-jobs.trace.json.gz \ src/contrib/mumak/src/test/data/19-jobs.topology.json.gz
  • 9.
    Implementation Experience JobTrackeris reasonably modular and amenable to a simulated environment RPC, Clock, DNS-Switch mapping are all interfaces No sleep() in main JT code Usage of threads is localized and easy to factor out Asynchronous job initialization: make them synchronous (AsjectJ) Inheritance is necessary to extend/alter the behavior JobTracker, JobInProgress, LaunchTaskAction, TaskTrackerStatus Convey extra information: virtual time, task execution time, etc Keep up with the base classes change may be hard Example: A new variable added to JobTracker Make dependency explicit between map & reduce tasks
  • 10.
    Mumak as aSystem Behavior Verifier
  • 11.
    Mumak as aJobTracker Debugger MAPREDUCE-995: “ JobHistory should handle cases where task completion events are generated after job completion event ” Discovered when testing Mumak patch for 21 submission Introduced by the MAPREDUCE-157, committed one day earlier Manifested as JobTracker crash due to IOException Root cause analysis Developer made wrong assumption of the timing of events Assumed that when a job is marked as finished, no more heartbeat events related to the job would follow Lead to a Closable object being used after it is closed To reproduce through benchmarking: need to inject a failed job and encounter “good” timing when an outstanding task completes after the job is marked as failed
  • 12.
    Mumak as aJobTracker Profiling Benchmark Memory allocation pattern similar to real JobTracker, but at much faster rate Mumak overhead is less than 20-30% Limitations: Cannot detect synchronization hotspots or sub-optimal IO or network operations Findings through YourKit profiling Wasteful String concatenations in Log.debug() statements in mapred.ResourceEstimator.getEstimatedTotalMapOutputSize Repetitive parsing of TaskTracker names to extract hostnames Unnecessary exceptions from counter localization due to a removed properties file (regression introduced by H-5717)
  • 13.
    Conclusions Mumak: Alight-weight, versatile tool for MapReduce verification and debugging Verification of overall system behavior A debugger for JobTracker / scheduler A micro-benchmark to stress CPU and memory allocation Strengths: Easy to setup and run Faster than running real benchmark: 1 min ~~ 2 hrs on a 2000-node cluster Realistically reproduce conditions and test actual code Can easily generate variants of ordering of distributed events Limitations: No simulation of system services or threads Cannot debug synchronization problems among threads Cannot reproduce OS-induced failures
  • 14.
    What Next? Simulatemore conditions Speculative execution Resource contentions Node failures Job dependencies Debug issues not resulting in hard-stop failures Fairness violation, starvation, utilization problems Patch validating, before and after comparison Making sure the patch does what is supposed to do, and does not introduce negative side effects Use Mumak to stage unit tests Construct testcases by building synthetic job stories
  • 15.