SlideShare a Scribd company logo
1 of 22
Download to read offline
Holistic Aggregate Resource Environment
Execution Model



           FastOS USENIX 2010 Workshop
           Eric Van Hensbergen (bergevan@us.ibm.com)
           http://hare.fastos2.org
Research Objectives

      • Look at ways of scaling general purpose operating systems
          and runtimes to leadership class supercomputers (thousands
          to millions of cores)
      •   Alternative approaches to systems software support, runtime
          and communications subsystems
      •   Exploration built on top of Plan 9 distributed operating system
          due to portability, built-in facilities for distributed systems and
          flexible communication model
      •   Plan 9 support for BG/P and HARE runtime open-sourced
          and available via: http://wiki.bg.anl-external.org
      •   Public profile available on ANL Surveyor BG/P machine,
          should be usable by anyone


  2
Roadmap
                                                  0      1      2    3

                Hardware Support
      Systems Infrastructure
 Evaluation, Scaling, & Tuning
   Year 2 Accomplishments                                 Year
 • ImprovedFramework
              tracing infrastructure
 • Curryinginfrastructure to 1000 nodes
 • Scaling model
 • Execution Blue Gene/P open sourced
 • Plan 9 for open sourced
 • Kittyhawk for Kittyhawk and Plan 9 installed at ANL on Surveyor
 • Default profiles
New Publications (since Supercomputing 2009)

      • Using Currying and process-private system calls to break the
        one-microsecond system call barrier, Ronald G. Minnich,
        John Floren, Jim Mckie; 2009 International Workshop on
        Plan9.
      • Measuring kernel throughput on Blue Gene/P with the Plan 9
        research operating system, Ronald G. Minnich, John Floren,
        Aki Nyrhinen; 2009 International Workshop on Plan9.
      • XCPU3. Pravin Shinde, Eric Van Hensbergen, Eurosys, 2010
      • PUSH, a Dataflow Shell. N Evans, E Van Hensbergen,
        Eurosys, 2010




  4
Ongoing Work

      • File system and Cache Studies
           • simple cachefs deployable on I/O nodes and compute
             nodes
           • experiments with direct attached storage using CORAID
      •   MPI Support (ROMPI)
      •   Enhanced Allocator
           • lower overhead allocator
           • working towards easier approach to multiple page sizes
           • working towards schemes capable of supporting hybrid
             communication models
      •   Scaling beyond 1000 nodes (runs on Intrepid at ANL)
      •   Application and Runtime Integration

  5
Execution Model




6
Core Concept: BRASIL
Basic Resource Aggregate System Inferno Layer
      • Stripped down Inferno - No GUI or anything we can live
        without, minimal footprint
      • Runs as a daemon (no console), all interaction via 9p mounts
        of its namespace
      • Different modes
        • default (exports /srv/brasil or on a tcp!127.0.0.1!5670)
        • gateway (exports over standard I/O - to be used by ssh initialization)
        • terminal (initiates ssh connection and starts a gateway)


      • Runs EVERYWHERE
        •   User’s workstation
        •   Surveyor Login Nodes
        •   I/O Nodes
        •   Compute Nodes

  7
nompirun: legacy friendly job launch

      • user initiates execution from login node using nompirun script
         • ie. nompirun -n 64 ronsapp

      • setup/boot/exec
         • script submits job using cobalt
         • When I/O node boots it connect to user’s brasild via 9P over Ethernet
         • When CPU nodes boot they connect to I/O node via 9P over Collective
         • After boilerplate initialization, $HOME/lib/profile is run on every node for
           additional setup, namespace initialization, and environment setup
         • User specified application runs with specified arguments on all compute nodes,
           application (and support data and configuration) can come from user’s home
           directory on login nodes or any available file server in the namespace
         • Standard I/O based output from all compute nodes aggregated at I/O nodes
           and sent over miniciod channel (thanks to some sample code from the zeptoos
           team) to the service nodes for standard reporting
      • Nodes boot and application execution begins in under 2 minutes

  8
Our Approach: Workload Optimized Distribution                                                        !"#$%&'
                                                                                                      !"#(#


                                                                                                     !"#$%&'
                                                                                            !"#$%&             !"#(#
                                                                                                      !"#(#



                                                                                                     !"#$%&'
                                                                                                      !"#(#


                                                                                                     !"#$%&'
                                                                                                      !"#(#


                                                                                                     !"#$%&'
                                                                                   !"#$%&   !"#$%&    !"#(#    !"#(#   !"#(#



                                                                                                     !"#$%&'
                                                                                                      !"#(#


                          local service   proxy service        aggregate service                     !"#$%&'
                                                                                                      !"#(#


                                                                                                     !"#$%&'
                                                                                            !"#$%&    !"#(#    !"#(#



                          local service                                                              !"#$%&'
                                                                                                      !"#(#




      Desktop Extension                                                             PUSH Pipeline Model
                                                          remote services



                                  Aggregation Via
                                Dynamic Namespace
          Scaling                        and                                                   Reliability
                                 Distributed Service
                                        Model


  9
Core Component: Multipipes & Filters

       UNIX Model
       a|b|c


       PUSH Model
       a |< b >| c



  10
Preferred Embodiment: BRASIL Desktop Extension Model
                                                                                  CPU
                               ssh-duct
            workstation                        login node            I/O

       •Setup                                                                     CPU
          •User starts brasild on workstation
             •brasild ssh’s to login node and starts another brasil hooking the
             two together with 27b-6 and mount resources in /csrv
          •User mounts brasild on workstation into namespace using 9pfuse or
          v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac)
       •Boot
          •User runs anl/run script on workstation
             •script interacts with taskfs on login node to start cobalt qsub
             •when I/O nodes boot it will connect its csrv to login csrv
             •when CPU nodes boot they will connect to csrv on I/O node
       •Task Execution
          •User runs anl/exec script on workstation to run app
             •script reserves x nodes for app using taskfs
             •taskfs on workstation aggregates execution by using taskfs
             running on I/O nodes
  11
Core Concept: Central Services
       • Establish hierarchical namespace on cluster services /csrv/
                                          of
       • Automount remote servers based reference (ie. cd
         criswell)
                                                                                  c3



                                       /csrv                           /csrv
                  t                        /local                          /local
                                           /L                              /l2
                                               /local                          /local
                                               /l1                             /c4
                                                   /local                          /local
                  L                                /c1                         /L
                                                       /local                      /local
                                                   /c2                             /t
        I1                 I2                          /local                          /local
                                               /l2                                 /l1
                                                   /local                              /local
  c1         c2       c3        c4                 /c3                                 /c1
                                                       /local                              /local
                                                   /c4                                 /c2
                                                       /local                              /local
  12
Core Concept: Taskfs

        • Provide xcpu2 like interface for starting tasks on a node
        • Hybrid model for multitask (aggregate ctl & I/O as well as
         granular)                               /0
                                                      /ctl
/local          - exported by each csrv node
                                                      /status
    /fs         - local (host) file system
                                                      /args
    /net        - local network interfaces
                                                      /env
    /brasil     - local (brasil) namespace
                                                      /stdin
    /arch       - architecture and platform
                                                      /stdout
    /status     - status (load/jobs/etc.)
                                                      /stderr
    /env        - default environment for host
                                                      /stdio
    /ns         - default namespace for host
                                                      /ns
    /clone      - establish a new task
                                                      /wait
    /#          - task sessions
                                                      /#          - component session(s)
                                                           /ctl
                                                                ...


   13
Evaluations:Deployment and aggregation time

PU3




m

d Work

n

ions

nces




         14
                                   XCPU3
What’s Still Missing from Execution Model?
  • File System back mounts still being developed
     • Can get around by mounting login node or user’s
       workstation to a known place no matter where you are in
       the system
     • When we get file system back mounts, we’ll need a way to
       get to the user’s desired file system no matter where in csrv
       topology we are ($MNTTERM)
  • Taskfs scheduling model still top down, needs to be able to
    propagate back up to allow efficient scheduling from leaf
    nodes
  • Performance
     • Reworking workload distribution to go bottom up to improve
       scalability and lower per-task overhead
     • Plan 9 native version of task model to improve performance
  15
New Model Breaks Up Implementation

       • mpipefs provides base I/O and control aggregation
       • execfs provides layer on top of system procfs for additional
         application control and initiating remote execution and uses
         mpipefs for interface to standard I/O
       • gangfs provides group process operations and aggregation
         as well as providing core distributed scheduling interfaces
         and builds upon execfs and use mpipes for ctl aggregation
       • statusfs will provide bottom up aggregation of system status
         through csrv hierarchy and feed metrics to gangfs scheduler
         using mpipes
       • csrv component provides membership management and
         hierarchical links between nodes and provide failure
         detection, avoidance and recovery

  16
Future Work: Generalized Multipipe System Component

       • Challenges
         • Record separation for large data sets
         • Determinism for HASH distributions
         • Support for multiple models
       • Our Approach
         • Single synthetic file per multipipe, configurations specified
           during pipe creation and initial write
         • Readers and Writers tracked and isolated
         • “Multipipe” mode uses headers for data flowing over pipes
           • Provides record separation via size-prefix
           • Can be used by filters to specify deterministic destination or can be used to
             allow for type-specific destinations
           • Can also send control messages in header blocks to control splicing

  17
Future Work on Execution Model

       • Caches will be necessary for desktop extension model to
           perform well
       •   Linux target support (using private namespaces and back
           mounts within taskfs execs)
       •   attribute-based file system queries/views and operations
            • Probably best implemented as a secondary file system
              layer on top of central services
       •   Language bindings for taskfs interactions (C, C++, python,
           etc.)
       •   Plug-in scheduling policies
       •   Failure and Reliability Model



  18
Questions?

       • This work has been supported in part by the Department of
         Energy Office of Science Operating and Runtime Systems for
         Extreme Scale Scientific Computation project under contract
         #DE-FG02-
       • More Info & Publications: http://hare.fastos2.org




  19
nompirun(8)

       NAME
          nompirun − wrapper script for running Plan 9 on BG/P

       SYNOPSIS
          nompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [
          ‐e key=value ]... [ ‐n num_nodes ] [ ‐t time ] [ ‐k kernel_profile ] [
          ‐r root_path ] cmd args...

           wrapnompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port
           ] [ ‐e key=value ]... [ ‐n num_cpu_nodes ] [ ‐t time ] [ ‐k ker‐
           nel_profile ] [ ‐r root_path ] cmd args...




  20
Core Concept: Ducts

        • Ducts are bi-directional 9P connections
        • They can be instantiated over any pipe
          • TCP/IP Connection

                                                    export
mount


                                  ssh


                                 tcp/ip

                                                    mount
export



   21
Core Concept: 27b-6 Ducts

        • Just like Ducts
        • Before export/mount, each side writes size-prefix canonical
         name

                                                                export
mount


                                  ssh


                                 tcp/ip

                                                                mount
export



   22

More Related Content

Viewers also liked

What the Department of Labor Says About the Assessments You Use
What the Department of Labor Says About the Assessments You UseWhat the Department of Labor Says About the Assessments You Use
What the Department of Labor Says About the Assessments You Usesceb
 
New York Times VOD
New York Times VODNew York Times VOD
New York Times VODsommerhixson
 
Texas Star Chart
Texas Star ChartTexas Star Chart
Texas Star Chartboombie2001
 
Developing Sustainable Competitive Advantages for Colleges and Universities
Developing Sustainable Competitive Advantages for Colleges and UniversitiesDeveloping Sustainable Competitive Advantages for Colleges and Universities
Developing Sustainable Competitive Advantages for Colleges and UniversitiesStamats
 

Viewers also liked (8)

Associations
AssociationsAssociations
Associations
 
What the Department of Labor Says About the Assessments You Use
What the Department of Labor Says About the Assessments You UseWhat the Department of Labor Says About the Assessments You Use
What the Department of Labor Says About the Assessments You Use
 
New York Times VOD
New York Times VODNew York Times VOD
New York Times VOD
 
Beacon Riverfest
Beacon RiverfestBeacon Riverfest
Beacon Riverfest
 
A entrada na Escola
A entrada na EscolaA entrada na Escola
A entrada na Escola
 
Texas Star Chart
Texas Star ChartTexas Star Chart
Texas Star Chart
 
Developing Sustainable Competitive Advantages for Colleges and Universities
Developing Sustainable Competitive Advantages for Colleges and UniversitiesDeveloping Sustainable Competitive Advantages for Colleges and Universities
Developing Sustainable Competitive Advantages for Colleges and Universities
 
Presentation
PresentationPresentation
Presentation
 

Similar to HARE 2010 Review

A SOA for the car - 01/2009
A SOA for the car - 01/2009A SOA for the car - 01/2009
A SOA for the car - 01/2009Roland Tritsch
 
How SmartLogic Uses Chef-Dan Ivovich
How SmartLogic Uses Chef-Dan IvovichHow SmartLogic Uses Chef-Dan Ivovich
How SmartLogic Uses Chef-Dan IvovichSmartLogic
 
ContainerCon 2015 - Be a Microservices Hero
ContainerCon 2015 - Be a Microservices HeroContainerCon 2015 - Be a Microservices Hero
ContainerCon 2015 - Be a Microservices HeroDragos Dascalita
 
Docker, Ansible and Symfony micro-kernel
Docker, Ansible and Symfony micro-kernelDocker, Ansible and Symfony micro-kernel
Docker, Ansible and Symfony micro-kernelDrupalCamp Kyiv
 
Imola informatica - cloud computing and software development
Imola informatica - cloud computing and software developmentImola informatica - cloud computing and software development
Imola informatica - cloud computing and software developmentFilippo Bosi
 
Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011Angelo van der Sijpt
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Rick G. Garibay
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWSContinuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWSDanilo Poccia
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageJames Montemagno
 
Be a microservices hero
Be a microservices heroBe a microservices hero
Be a microservices heroOpenRestyCon
 

Similar to HARE 2010 Review (20)

A SOA for the car - 01/2009
A SOA for the car - 01/2009A SOA for the car - 01/2009
A SOA for the car - 01/2009
 
How SmartLogic Uses Chef-Dan Ivovich
How SmartLogic Uses Chef-Dan IvovichHow SmartLogic Uses Chef-Dan Ivovich
How SmartLogic Uses Chef-Dan Ivovich
 
Push Podc09
Push Podc09Push Podc09
Push Podc09
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
ContainerCon 2015 - Be a Microservices Hero
ContainerCon 2015 - Be a Microservices HeroContainerCon 2015 - Be a Microservices Hero
ContainerCon 2015 - Be a Microservices Hero
 
Docker, Ansible and Symfony micro-kernel
Docker, Ansible and Symfony micro-kernelDocker, Ansible and Symfony micro-kernel
Docker, Ansible and Symfony micro-kernel
 
Device deployment
Device deploymentDevice deployment
Device deployment
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Profiling for Grown-Ups
Profiling for Grown-UpsProfiling for Grown-Ups
Profiling for Grown-Ups
 
Imola informatica - cloud computing and software development
Imola informatica - cloud computing and software developmentImola informatica - cloud computing and software development
Imola informatica - cloud computing and software development
 
Pilot Interim Results
Pilot Interim ResultsPilot Interim Results
Pilot Interim Results
 
About Clack
About ClackAbout Clack
About Clack
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
Cloudbees -Open Source Versus Business - nicolas de loof - fossa2011
Cloudbees -Open Source Versus Business - nicolas de loof - fossa2011Cloudbees -Open Source Versus Business - nicolas de loof - fossa2011
Cloudbees -Open Source Versus Business - nicolas de loof - fossa2011
 
Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011Massive device deployment - EclipseCon 2011
Massive device deployment - EclipseCon 2011
 
Bonjour for Java
Bonjour for JavaBonjour for Java
Bonjour for Java
 
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
Visual Studio 2013, Xamarin and Microsoft Azure Mobile Services: A Match Made...
 
Continuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWSContinuous Integration and Deployment Best Practices on AWS
Continuous Integration and Deployment Best Practices on AWS
 
Desert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming languageDesert Code Camp 2014: C#, the best programming language
Desert Code Camp 2014: C#, the best programming language
 
Be a microservices hero
Be a microservices heroBe a microservices hero
Be a microservices hero
 

More from Eric Van Hensbergen

More from Eric Van Hensbergen (18)

Scaling Arm from One to One Trillion
Scaling Arm from One to One TrillionScaling Arm from One to One Trillion
Scaling Arm from One to One Trillion
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
 
ISC14 Embedded HPC BoF Panel Presentation
ISC14 Embedded HPC BoF Panel PresentationISC14 Embedded HPC BoF Panel Presentation
ISC14 Embedded HPC BoF Panel Presentation
 
Simulation Directed Co-Design from Smartphones to Supercomputers
Simulation Directed Co-Design from Smartphones to SupercomputersSimulation Directed Co-Design from Smartphones to Supercomputers
Simulation Directed Co-Design from Smartphones to Supercomputers
 
Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)Scalable Elastic Systems Architecture (SESA)
Scalable Elastic Systems Architecture (SESA)
 
Multipipes
MultipipesMultipipes
Multipipes
 
Multi-pipes
Multi-pipesMulti-pipes
Multi-pipes
 
VirtFS
VirtFSVirtFS
VirtFS
 
PUSH-- a Dataflow Shell
PUSH-- a Dataflow ShellPUSH-- a Dataflow Shell
PUSH-- a Dataflow Shell
 
XCPU3: Workload Distribution and Aggregation
XCPU3: Workload Distribution and AggregationXCPU3: Workload Distribution and Aggregation
XCPU3: Workload Distribution and Aggregation
 
9P Code Walkthrough
9P Code Walkthrough9P Code Walkthrough
9P Code Walkthrough
 
9P Overview
9P Overview9P Overview
9P Overview
 
Libra: a Library OS for a JVM
Libra: a Library OS for a JVMLibra: a Library OS for a JVM
Libra: a Library OS for a JVM
 
Effect of Virtualization on OS Interference
Effect of Virtualization on OS InterferenceEffect of Virtualization on OS Interference
Effect of Virtualization on OS Interference
 
PROSE
PROSEPROSE
PROSE
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Paravirtualized File Systems
Paravirtualized File SystemsParavirtualized File Systems
Paravirtualized File Systems
 

HARE 2010 Review

  • 1. Holistic Aggregate Resource Environment Execution Model FastOS USENIX 2010 Workshop Eric Van Hensbergen (bergevan@us.ibm.com) http://hare.fastos2.org
  • 2. Research Objectives • Look at ways of scaling general purpose operating systems and runtimes to leadership class supercomputers (thousands to millions of cores) • Alternative approaches to systems software support, runtime and communications subsystems • Exploration built on top of Plan 9 distributed operating system due to portability, built-in facilities for distributed systems and flexible communication model • Plan 9 support for BG/P and HARE runtime open-sourced and available via: http://wiki.bg.anl-external.org • Public profile available on ANL Surveyor BG/P machine, should be usable by anyone 2
  • 3. Roadmap 0 1 2 3 Hardware Support Systems Infrastructure Evaluation, Scaling, & Tuning Year 2 Accomplishments Year • ImprovedFramework tracing infrastructure • Curryinginfrastructure to 1000 nodes • Scaling model • Execution Blue Gene/P open sourced • Plan 9 for open sourced • Kittyhawk for Kittyhawk and Plan 9 installed at ANL on Surveyor • Default profiles
  • 4. New Publications (since Supercomputing 2009) • Using Currying and process-private system calls to break the one-microsecond system call barrier, Ronald G. Minnich, John Floren, Jim Mckie; 2009 International Workshop on Plan9. • Measuring kernel throughput on Blue Gene/P with the Plan 9 research operating system, Ronald G. Minnich, John Floren, Aki Nyrhinen; 2009 International Workshop on Plan9. • XCPU3. Pravin Shinde, Eric Van Hensbergen, Eurosys, 2010 • PUSH, a Dataflow Shell. N Evans, E Van Hensbergen, Eurosys, 2010 4
  • 5. Ongoing Work • File system and Cache Studies • simple cachefs deployable on I/O nodes and compute nodes • experiments with direct attached storage using CORAID • MPI Support (ROMPI) • Enhanced Allocator • lower overhead allocator • working towards easier approach to multiple page sizes • working towards schemes capable of supporting hybrid communication models • Scaling beyond 1000 nodes (runs on Intrepid at ANL) • Application and Runtime Integration 5
  • 7. Core Concept: BRASIL Basic Resource Aggregate System Inferno Layer • Stripped down Inferno - No GUI or anything we can live without, minimal footprint • Runs as a daemon (no console), all interaction via 9p mounts of its namespace • Different modes • default (exports /srv/brasil or on a tcp!127.0.0.1!5670) • gateway (exports over standard I/O - to be used by ssh initialization) • terminal (initiates ssh connection and starts a gateway) • Runs EVERYWHERE • User’s workstation • Surveyor Login Nodes • I/O Nodes • Compute Nodes 7
  • 8. nompirun: legacy friendly job launch • user initiates execution from login node using nompirun script • ie. nompirun -n 64 ronsapp • setup/boot/exec • script submits job using cobalt • When I/O node boots it connect to user’s brasild via 9P over Ethernet • When CPU nodes boot they connect to I/O node via 9P over Collective • After boilerplate initialization, $HOME/lib/profile is run on every node for additional setup, namespace initialization, and environment setup • User specified application runs with specified arguments on all compute nodes, application (and support data and configuration) can come from user’s home directory on login nodes or any available file server in the namespace • Standard I/O based output from all compute nodes aggregated at I/O nodes and sent over miniciod channel (thanks to some sample code from the zeptoos team) to the service nodes for standard reporting • Nodes boot and application execution begins in under 2 minutes 8
  • 9. Our Approach: Workload Optimized Distribution !"#$%&' !"#(# !"#$%&' !"#$%& !"#(# !"#(# !"#$%&' !"#(# !"#$%&' !"#(# !"#$%&' !"#$%& !"#$%& !"#(# !"#(# !"#(# !"#$%&' !"#(# local service proxy service aggregate service !"#$%&' !"#(# !"#$%&' !"#$%& !"#(# !"#(# local service !"#$%&' !"#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace Scaling and Reliability Distributed Service Model 9
  • 10. Core Component: Multipipes & Filters UNIX Model a|b|c PUSH Model a |< b >| c 10
  • 11. Preferred Embodiment: BRASIL Desktop Extension Model CPU ssh-duct workstation login node I/O •Setup CPU •User starts brasild on workstation •brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv •User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac) •Boot •User runs anl/run script on workstation •script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv •when CPU nodes boot they will connect to csrv on I/O node •Task Execution •User runs anl/exec script on workstation to run app •script reserves x nodes for app using taskfs •taskfs on workstation aggregates execution by using taskfs running on I/O nodes 11
  • 12. Core Concept: Central Services • Establish hierarchical namespace on cluster services /csrv/ of • Automount remote servers based reference (ie. cd criswell) c3 /csrv /csrv t /local /local /L /l2 /local /local /l1 /c4 /local /local L /c1 /L /local /local /c2 /t I1 I2 /local /local /l2 /l1 /local /local c1 c2 c3 c4 /c3 /c1 /local /local /c4 /c2 /local /local 12
  • 13. Core Concept: Taskfs • Provide xcpu2 like interface for starting tasks on a node • Hybrid model for multitask (aggregate ctl & I/O as well as granular) /0 /ctl /local - exported by each csrv node /status /fs - local (host) file system /args /net - local network interfaces /env /brasil - local (brasil) namespace /stdin /arch - architecture and platform /stdout /status - status (load/jobs/etc.) /stderr /env - default environment for host /stdio /ns - default namespace for host /ns /clone - establish a new task /wait /# - task sessions /# - component session(s) /ctl ... 13
  • 14. Evaluations:Deployment and aggregation time PU3 m d Work n ions nces 14 XCPU3
  • 15. What’s Still Missing from Execution Model? • File System back mounts still being developed • Can get around by mounting login node or user’s workstation to a known place no matter where you are in the system • When we get file system back mounts, we’ll need a way to get to the user’s desired file system no matter where in csrv topology we are ($MNTTERM) • Taskfs scheduling model still top down, needs to be able to propagate back up to allow efficient scheduling from leaf nodes • Performance • Reworking workload distribution to go bottom up to improve scalability and lower per-task overhead • Plan 9 native version of task model to improve performance 15
  • 16. New Model Breaks Up Implementation • mpipefs provides base I/O and control aggregation • execfs provides layer on top of system procfs for additional application control and initiating remote execution and uses mpipefs for interface to standard I/O • gangfs provides group process operations and aggregation as well as providing core distributed scheduling interfaces and builds upon execfs and use mpipes for ctl aggregation • statusfs will provide bottom up aggregation of system status through csrv hierarchy and feed metrics to gangfs scheduler using mpipes • csrv component provides membership management and hierarchical links between nodes and provide failure detection, avoidance and recovery 16
  • 17. Future Work: Generalized Multipipe System Component • Challenges • Record separation for large data sets • Determinism for HASH distributions • Support for multiple models • Our Approach • Single synthetic file per multipipe, configurations specified during pipe creation and initial write • Readers and Writers tracked and isolated • “Multipipe” mode uses headers for data flowing over pipes • Provides record separation via size-prefix • Can be used by filters to specify deterministic destination or can be used to allow for type-specific destinations • Can also send control messages in header blocks to control splicing 17
  • 18. Future Work on Execution Model • Caches will be necessary for desktop extension model to perform well • Linux target support (using private namespaces and back mounts within taskfs execs) • attribute-based file system queries/views and operations • Probably best implemented as a secondary file system layer on top of central services • Language bindings for taskfs interactions (C, C++, python, etc.) • Plug-in scheduling policies • Failure and Reliability Model 18
  • 19. Questions? • This work has been supported in part by the Department of Energy Office of Science Operating and Runtime Systems for Extreme Scale Scientific Computation project under contract #DE-FG02- • More Info & Publications: http://hare.fastos2.org 19
  • 20. nompirun(8) NAME nompirun − wrapper script for running Plan 9 on BG/P SYNOPSIS nompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [ ‐e key=value ]... [ ‐n num_nodes ] [ ‐t time ] [ ‐k kernel_profile ] [ ‐r root_path ] cmd args... wrapnompirun [ ‐A cobalt_account ] [ ‐h brasil_host ] [ ‐h brasil_port ] [ ‐e key=value ]... [ ‐n num_cpu_nodes ] [ ‐t time ] [ ‐k ker‐ nel_profile ] [ ‐r root_path ] cmd args... 20
  • 21. Core Concept: Ducts • Ducts are bi-directional 9P connections • They can be instantiated over any pipe • TCP/IP Connection export mount ssh tcp/ip mount export 21
  • 22. Core Concept: 27b-6 Ducts • Just like Ducts • Before export/mount, each side writes size-prefix canonical name export mount ssh tcp/ip mount export 22