HARE 2010 Review


Published on

Presentation given at USENIX FastOS workshop reviewing HARE work done in 2010.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HARE 2010 Review

  1. 1. Holistic Aggregate Resource Environment Execution Model FastOS USENIX 2010 Workshop Eric Van Hensbergen (bergevan@us.ibm.com) http://hare.fastos2.org
  2. 2. Research Objectives • Look at ways of scaling general purpose operating systems and runtimes to leadership class supercomputers (thousands to millions of cores) • Alternative approaches to systems software support, runtime and communications subsystems • Exploration built on top of Plan 9 distributed operating system due to portability, built-in facilities for distributed systems and flexible communication model • Plan 9 support for BG/P and HARE runtime open-sourced and available via: http://wiki.bg.anl-external.org • Public profile available on ANL Surveyor BG/P machine, should be usable by anyone 2
  3. 3. Roadmap 0 1 2 3 Hardware Support Systems Infrastructure Evaluation, Scaling, & Tuning Year 2 Accomplishments Year • ImprovedFramework tracing infrastructure • Curryinginfrastructure to 1000 nodes • Scaling model • Execution Blue Gene/P open sourced • Plan 9 for open sourced • Kittyhawk for Kittyhawk and Plan 9 installed at ANL on Surveyor • Default profiles
  4. 4. New Publications (since Supercomputing 2009) • Using Currying and process-private system calls to break the one-microsecond system call barrier, Ronald G. Minnich, John Floren, Jim Mckie; 2009 International Workshop on Plan9. • Measuring kernel throughput on Blue Gene/P with the Plan 9 research operating system, Ronald G. Minnich, John Floren, Aki Nyrhinen; 2009 International Workshop on Plan9. • XCPU3. Pravin Shinde, Eric Van Hensbergen, Eurosys, 2010 • PUSH, a Dataflow Shell. N Evans, E Van Hensbergen, Eurosys, 2010 4
  5. 5. Ongoing Work • File system and Cache Studies • simple cachefs deployable on I/O nodes and compute nodes • experiments with direct attached storage using CORAID • MPI Support (ROMPI) • Enhanced Allocator • lower overhead allocator • working towards easier approach to multiple page sizes • working towards schemes capable of supporting hybrid communication models • Scaling beyond 1000 nodes (runs on Intrepid at ANL) • Application and Runtime Integration 5
  6. 6. Execution Model 6
  7. 7. Core Concept: BRASIL Basic Resource Aggregate System Inferno Layer • Stripped down Inferno - No GUI or anything we can live without, minimal footprint • Runs as a daemon (no console), all interaction via 9p mounts of its namespace • Different modes • default (exports /srv/brasil or on a tcp!!5670) • gateway (exports over standard I/O - to be used by ssh initialization) • terminal (initiates ssh connection and starts a gateway) • Runs EVERYWHERE • User’s workstation • Surveyor Login Nodes • I/O Nodes • Compute Nodes 7
  8. 8. nompirun: legacy friendly job launch • user initiates execution from login node using nompirun script • ie. nompirun -n 64 ronsapp • setup/boot/exec • script submits job using cobalt • When I/O node boots it connect to user’s brasild via 9P over Ethernet • When CPU nodes boot they connect to I/O node via 9P over Collective • After boilerplate initialization, $HOME/lib/profile is run on every node for additional setup, namespace initialization, and environment setup • User specified application runs with specified arguments on all compute nodes, application (and support data and configuration) can come from user’s home directory on login nodes or any available file server in the namespace • Standard I/O based output from all compute nodes aggregated at I/O nodes and sent over miniciod channel (thanks to some sample code from the zeptoos team) to the service nodes for standard reporting • Nodes boot and application execution begins in under 2 minutes 8
  9. 9. Our Approach: Workload Optimized Distribution !"#$%&' !"#(# !"#$%&' !"#$%& !"#(# !"#(# !"#$%&' !"#(# !"#$%&' !"#(# !"#$%&' !"#$%& !"#$%& !"#(# !"#(# !"#(# !"#$%&' !"#(# local service proxy service aggregate service !"#$%&' !"#(# !"#$%&' !"#$%& !"#(# !"#(# local service !"#$%&' !"#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace Scaling and Reliability Distributed Service Model 9
  10. 10. Core Component: Multipipes & Filters UNIX Model a|b|c PUSH Model a |< b >| c 10
  11. 11. Preferred Embodiment: BRASIL Desktop Extension Model CPU ssh-duct workstation login node I/O •Setup CPU •User starts brasild on workstation •brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv •User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac) •Boot •User runs anl/run script on workstation •script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv •when CPU nodes boot they will connect to csrv on I/O node •Task Execution •User runs anl/exec script on workstation to run app •script reserves x nodes for app using taskfs •taskfs on workstation aggregates execution by using taskfs running on I/O nodes 11
  12. 12. Core Concept: Central Services • Establish hierarchical namespace on cluster services /csrv/ of • Automount remote servers based reference (ie. cd criswell) c3 /csrv /csrv t /local /local /L /l2 /local /local /l1 /c4 /local /local L /c1 /L /local /local /c2 /t I1 I2 /local /local /l2 /l1 /local /local c1 c2 c3 c4 /c3 /c1 /local /local /c4 /c2 /local /local 12
  13. 13. Core Concept: Taskfs • Provide xcpu2 like interface for starting tasks on a node • Hybrid model for multitask (aggregate ctl & I/O as well as granular) /0 /ctl /local - exported by each csrv node /status /fs - local (host) file system /args /net - local network interfaces /env /brasil - local (brasil) namespace /stdin /arch - architecture and platform /stdout /status - status (load/jobs/etc.) /stderr /env - default environment for host /stdio /ns - default namespace for host /ns /clone - establish a new task /wait /# - task sessions /# - component session(s) /ctl ... 13
  14. 14. Evaluations:Deployment and aggregation time PU3 m d Work n ions nces 14 XCPU3
  15. 15. What’s Still Missing from Execution Model? • File System back mounts still being developed • Can get around by mounting login node or user’s workstation to a known place no matter where you are in the system • When we get file system back mounts, we’ll need a way to get to the user’s desired file system no matter where in csrv topology we are ($MNTTERM) • Taskfs scheduling model still top down, needs to be able to propagate back up to allow efficient scheduling from leaf nodes • Performance • Reworking workload distribution to go bottom up to improve scalability and lower per-task overhead • Plan 9 native version of task model to improve performance 15
  16. 16. New Model Breaks Up Implementation • mpipefs provides base I/O and control aggregation • execfs provides layer on top of system procfs for additional application control and initiating remote execution and uses mpipefs for interface to standard I/O • gangfs provides group process operations and aggregation as well as providing core distributed scheduling interfaces and builds upon execfs and use mpipes for ctl aggregation • statusfs will provide bottom up aggregation of system status through csrv hierarchy and feed metrics to gangfs scheduler using mpipes • csrv component provides membership management and hierarchical links between nodes and provide failure detection, avoidance and recovery 16
  17. 17. Future Work: Generalized Multipipe System Component • Challenges • Record separation for large data sets • Determinism for HASH distributions • Support for multiple models • Our Approach • Single synthetic file per multipipe, configurations specified during pipe creation and initial write • Readers and Writers tracked and isolated • “Multipipe” mode uses headers for data flowing over pipes • Provides record separation via size-prefix • Can be used by filters to specify deterministic destination or can be used to allow for type-specific destinations • Can also send control messages in header blocks to control splicing 17
  18. 18. Future Work on Execution Model • Caches will be necessary for desktop extension model to perform well • Linux target support (using private namespaces and back mounts within taskfs execs) • attribute-based file system queries/views and operations • Probably best implemented as a secondary file system layer on top of central services • Language bindings for taskfs interactions (C, C++, python, etc.) • Plug-in scheduling policies • Failure and Reliability Model 18
  19. 19. Questions? • This work has been supported in part by the Department of Energy Office of Science Operating and Runtime Systems for Extreme Scale Scientific Computation project under contract #DE-FG02- • More Info & Publications: http://hare.fastos2.org 19
  20. 20. nompirun(8) NAME nompirun − wrapper script for running Plan 9 on BG/P SYNOPSIS nompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [ ‐e key=value ]... [ ‐n num_nodes ] [ ‐t time ] [ ‐k kernel_profile ] [ ‐r root_path ] cmd args... wrapnompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [ ‐e key=value ]... [ ‐n num_cpu_nodes ] [ ‐t time ] [ ‐k ker‐ nel_profile ] [ ‐r root_path ] cmd args... 20
  21. 21. Core Concept: Ducts • Ducts are bi-directional 9P connections • They can be instantiated over any pipe • TCP/IP Connection export mount ssh tcp/ip mount export 21
  22. 22. Core Concept: 27b-6 Ducts • Just like Ducts • Before export/mount, each side writes size-prefix canonical name export mount ssh tcp/ip mount export 22