DIOS - compilers


Published on

My DIOS presentation for compilers. This is meant more for a compiler-oriented audience

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Our project is about how to schedule jobs among a group of machines. Our implementation is at the user level, but the same idea could be applied in the kernel of a distributed operating system. Long-running, short-running, memory-intensive, cpu-bound…don’t know what kind of jobs to expect. So how can the scheduler put them where they should be if it doesn’t know these things? Transition: Wouldn’t it be nice if the scheduler could just “handle it” – without the user having specify characteristics of their jobs in advance?
  • Our approach to this problem is DIOS – an adaptive distributed scheduler. Describe diagram: local schedulers (Hare) run on each machine, with queues of jobs. Global scheduler (Rhino) receives events from the Hares and sends down actions – like, migrate, or pause. Transition: So you must be thinking…wait, how are you going to just “gather application-specific info”?
  • The answer is – we’ll write a tool with Pin, a dynamic instrumentation framework. Describe diagram – as you can see from the diagram, and from this command up here, Pin is kind of like a miniature virtual machine. It takes in a pintool and the program binary, and runs it in the context of Pin, inserting new code into the application as it runs – using the tool as the instructions for what code to execute and where to insert it. For example, a pintool to count the number of instructions executed in a program could insert code to increment a variable before every instruction. There are several point instrumentation can be introduced – our pintool uses routine-level and instruction-level.
  • So we’ve established that Pin is a tool for what we want to do – dynamically instrument applications. But what code do we want to insert? What are we looking to get from our pintool? Since we are trying to detect and avoid memory contention between processes, it makes senses to study the memory behavior of the applications. To this end, we chose three things (describe them). The figure to side there shows how the pintool fits in to our overall plan – it would collect information for each application and report the results to Hare, the local scheduler. Then Hare, which is also monitoring the memory subsystem of the local machine, reports to Rhino, and Rhino decides what to do.
  • Considering our motivation, it was important to try to evaluate it on a somewhat realistic workload. Since it seems like most long-running jobs on clusters are scientific applications, we wanted to use real scientific benchmarks. Describe benchmarks. To evaluate the scheduler, we measured the total runtime from groups of 100 jobs. We varied the parameters to the heatedplate program (dataset size and number of iterationas) in order to vary the length of the jobs, and produced a set of jobs on a curve – a great many short-running jobs with a few long-running jobs. Past work indicates that is a common job submission trend in batch systems. Then, to evaluate our pintool, we measured the overhead from running each application with our pintool and also tracked the information we collected over time to see if we could correlate it to interesting behavior or differences between programs.
  • So here are our results from evaluating the distributed scheduler by itself. The good news is we saw potential for improvement –just from using a simple policy to react to the presence of memory contention, the total runtime goes down. Might be able to get even better results on long-running jobs, with better information on the running processes (like we could get from dynamic instrumentation!) So if you’re wondering why we’re showing you results for our scheduler with this simple policy – but not with our whole system of including application-specfic information…well that brings me to The Bad.
  • Although our scheduler works perfectly well with the pintool, we discovered that the overhead introduced by Pin is just too much. Some of our overhead results are below – we show the time to run the application natively, with pin (no pintool), with a tool that only counts instructions, and with our three metrics. The way we hoped to solve the overhead problem originally was to basically only instrument when we needed to –like when the scheduler decided the machine was performing badly. Then, the relatively high overhead to run the analysis wouldn’t have to make much of an impact overall. However, we were unable to get the performance gains we hoped – Pin doesn’t offer the ability to completely attach and detach from a running program, only to attach, and we discovered when we tried to add and remove instrumentation dynamically that we lost the gains from code caching. So while this idea could work with another system or with a new Pin, we couldn’t manage to bring the overhead down.
  • But on the bright side, we were able to collect some interesting information – this figure shows the variation over time of our memory instruction measurements – it shows the change in the number of memory instructions executed in a window over time – hence the negative numbers. Note how similar the patterns of LU and heatedplate are – talk about how that’s probably because they are tightly looped and very repetitive, whereas Ocean is obviously performing a more irregular and complex analysis with some possible distinct phases in it. Possibility of using the variation in a metric like this to “predict the predictability” - to separate applications that are better left alone from those that are more likely to be safely handled by common heuristics, etc.
  • So – the future of DIOS.
  • Questions?
  • Kind of...but no comprehensive solution.
  • DIOS - compilers

    1. 1. DIOS: Dynamic Instrumentation for (not so) Outstanding Scheduling Blake Sutton & Chris Sosa
    2. 2. Motivation ON OR
    3. 3. Approach: Adaptive Distributed Scheduler <ul><li>Centralized global scheduler and distributed local services </li></ul><ul><li>Hares monitor machines for “undesirable” events </li></ul><ul><li>Hares also gather application-specific info with Pin </li></ul><ul><li>Rhino schedules jobs and responds to events from Hares </li></ul><ul><ul><li>Migrate </li></ul></ul><ul><ul><li>Pause / Resume </li></ul></ul><ul><ul><li>Kill / Restart </li></ul></ul>
    4. 4. “ Pinvolvement”: What it is <ul><li>Insert new code into apps on the fly </li></ul><ul><ul><li>No recompile </li></ul></ul><ul><ul><li>Operates on a copy </li></ul></ul><ul><ul><li>Code caching </li></ul></ul><ul><li>Our Pintool </li></ul><ul><ul><li>Routine-level </li></ul></ul><ul><ul><li>Instruction-level </li></ul></ul>pin –t mytool -- ./myprogram Borrowed from Luk et al. 2005.
    5. 5. “ Pinvolvement”: What it measures <ul><li>No reliance on hardware-specific-performance counters </li></ul><ul><li>Want to capture memory behavior over time </li></ul><ul><li>Gathered: </li></ul><ul><ul><li>Ratio of malloc to free calls </li></ul></ul><ul><ul><li>Wall-clock time to execute 10,000,000 insns </li></ul></ul><ul><ul><li>Number of memory ops in last 2,000,000 insns </li></ul></ul>
    6. 6. Evaluation <ul><li>Distributed scheduler </li></ul><ul><ul><li>Rhino on realitytv13, Hare on realitytv13-16 </li></ul></ul><ul><ul><li>heatedplate with modified parameters </li></ul></ul><ul><ul><li>Hares detect if lower than 10% memory available and informs Rhino to take action </li></ul></ul><ul><ul><li>Rhino reschedules youngest job at Hare site </li></ul></ul><ul><ul><li>Baseline: Smallest Queues </li></ul></ul><ul><li>Pintool </li></ul><ul><ul><li>2 applications from SPLASH-2 </li></ul></ul><ul><ul><li>Heatedplate </li></ul></ul>
    7. 7. Results: The Good <ul><li>Scheduler shows potential for improvement </li></ul><ul><li>Lower total runtime with simple policy </li></ul>
    8. 8. Results: The Bad <ul><li>Overhead from Pintool is too high to realize gains </li></ul><ul><ul><li>Pin isn’t designed for on-the-fly analysis </li></ul></ul><ul><ul><li>Could not reattach </li></ul></ul><ul><ul><li>Code caching isn’t enough </li></ul></ul>7.64 7.90 14.51 6.27 1.25 1.00 lu 5.81 6.04 7.84 2.87 1.48 1.00 ocean 7.26 7.45 5.43 2.65 1.88 1.00 heatedplate latency # mems malloc/free count only pin native application
    9. 9. Results: The “Interesting” <ul><li>Pintool does capture intriguing info… </li></ul>
    10. 10. Other Issues <ul><li>Condor </li></ul><ul><ul><li>Process migration requires re-linking </li></ul></ul><ul><ul><li>Doesn’t support multithreaded applications </li></ul></ul><ul><ul><li>Other “user-level” process migration mechanisms have similar requirements </li></ul></ul><ul><li>Pin </li></ul><ul><ul><li>Unable to intersperse low and high overhead with Pintool </li></ul></ul><ul><ul><li>Even the smallest overhead was not negligible </li></ul></ul><ul><ul><li>Up to almost 2x slowdown just using Pin with heatedplate and no extra instrumentation </li></ul></ul><ul><li>Scheduling decisions have a bigger impact for long-running jobs </li></ul>
    11. 11. Conclusion: the Future of DIOS <ul><li>Overhead is prohibitive (for now) </li></ul><ul><ul><li>Pin needs to support reattach </li></ul></ul><ul><ul><li>Lighter instrumentation framework </li></ul></ul><ul><li>However, instrumentation can capture aspects of application-specific behavior </li></ul><ul><li>Future Work </li></ul><ul><ul><li>Pin as a process migration mechanism </li></ul></ul>
    12. 12. ¿ Preguntas?
    13. 13. Wait…hasn’t this been solved? <ul><li>Condor </li></ul><ul><ul><li>popular user-space distributed scheduler </li></ul></ul><ul><ul><li>process migration </li></ul></ul><ul><ul><li>tries to keep queues balanced </li></ul></ul><ul><ul><ul><li>but jobs have different behavior </li></ul></ul></ul><ul><ul><ul><li>over time </li></ul></ul></ul><ul><ul><ul><li>from each other </li></ul></ul></ul><ul><li>LSF (Load Sharing Facility) </li></ul><ul><ul><li>monitors system, moves processes around based on what they need </li></ul></ul><ul><ul><li>must input static job information (requires profiling etc beforehand) </li></ul></ul><ul><ul><ul><li>what if something about your job isn't captured by your input? </li></ul></ul></ul><ul><ul><ul><li>what if you end up giving it margins that are too large? too small? </li></ul></ul></ul><ul><ul><ul><li>unnecessary inefficiencies? </li></ul></ul></ul><ul><ul><ul><li>it's not exactly hassle-free... </li></ul></ul></ul><ul><ul><ul><li>  </li></ul></ul></ul><ul><li>Hardware feedback </li></ul><ul><ul><li>PAPI </li></ul></ul><ul><ul><li>Still not very portable (invasive kernel patch for install) </li></ul></ul><ul><li>Wouldn't it be nice if the scheduler could just...&quot;do the right thing&quot;? </li></ul>