The document discusses a holistic aggregate resource environment project involving IBM Research, Sandia National Labs, Bell Labs, and CMU. The goals of the project include leveraging aggregation as a first-class systems construct, distributing system services throughout supercomputers, and exploring native interconnect utilization. Research topics discussed include offload/acceleration models, right-weight kernels, and topologies.
1. Holistic Aggregate
Resource Environment
Eric Van Hensbergen (IBM Research)
Ron Minnich (Sandia National Labs)
Jim McKie (Bell Labs)
Charles Forsyth (Vita Nuova)
David Eckhardt (CMU)
3. Research Topics
• Pre-requisite: reliability and application driven design is pervasive in all
explored areas
• Offload/Acceleration Deployment Model
• Supercomputer needs to become an extension of scientist's desktop
as opposed to batch driven, non-standard run-time environment.
• Leverage aggregation as a first-class systems construct to help manage
complexity and provide a foundation for scalability, reliability, and
efficiency.
• Distribute system services throughout the machine (not just on io-node)
• Interconnect Abstractions & Utilization
• Leverage HPC interconnects in system services (file system, etc.)
• sockets & TCP/IP don't map well to HPC interconnects (torus and
collective) and are inefficient when hardware provides reliability
4. Right Weight Kernel
• General purpose multi-thread, multi-user environment
• Pleasantly Portable
• Relatively Lightweight (relative to Linux)
• Core Principles
• All resources are synthetic file hierarchies
• Local & remote resources accessed via simple API
• Each thread can dynamically organize local and
remote resources via dynamic private namespace
5. Aggregation
• extend BG/P aggregation model
beyond I/O and CPU node barrier
• allow grouping of nodes into
collaborating aggregates with
distributed system services and
dedicated service nodes
• allow specialized kernel for file local service proxy service aggregate service
service, monitoring, checkpointing,
and network routing
• parameterized redundancy, reliability,
local service
and scaling
• allows dynamic (re-) organization of
programming model to match the
(changing) workload
remote services
7. Desktop Extension
• Users want super computers to be an extension of their desktop
• Current parallel model is traditional batch model
• Workloads must use specialized compilers and be scheduled from special
front-end node. Results are collected into a separate file system
• Monitoring and job control through web interface or MMCS command line
• Very difficult development environment and lack of interactivity limits
productivity of execution environment
• Proposed Research
• leverage library OS commercial scale-out work to allow tighter
coupling between desktop environment and super computer resources
• Construct runtime environment which includes some reasonable subset
of support for typical Linux run-time requirements (glibc, python, etc.)
8. Extension Example
app brasil brasil app app
osx internet Linux Plan 9 Plan 9
Mac pSeries I/O CPU
ssh 10GB Ether collective
...
torus
9. Native Interconnects
• Blue Gene specialized networks are used primarily by user
space run-time
• Hardware is directly accessed by user space runtime time
environment and are not shared leading to poor utilization
• Exclusive use of tree network for I/O limits bandwidth and
reliability
• Proposed Solution
• Light weight system software interfaces to interconnects
so that they can be leveraged for system management,
monitoring, and resource sharing as well as user
applications
10. Protocol Exploration
• The Blue Gene networks are unusual (eg, 3D torus carrying 240-byte payloads)
• IP Works, but isn’t well matched to the underlying capabilities
• We want an efficient transport protocol to carry 9P messages & other data
streams
• Related Work: IBM’s ‘one-sided’ messaging operations [Blocksome et al]
• It supports both MPI and non-MPI applications such as Global Arrays
• Inspired by the IBM messaging protocol, we think we might do better than just IP
• Years ago there was much work on lightweight protocols for high-speed
networks
• We are using ideas from that earlier research to implement an efficient protocol
to carry 9P conversations
11. Project Roadmap
0 1 2 3
Hardware Support
Systems Infrastructure
Evaluation, Scaling, & Tuning
Year
15. PUSH
!"#$$%
,-.# ,-.#
&'(()*+
,-.# !"#$$% ,-.#
&'(()*+
,-.# !"#$$% ,-.#
&'(()*+
!"#$$% !"#$$%
,-.# /0$1-.$#2'3 4#(0$1-.$#2'3 ,-.#
&'(()*+ &'(()*+
,-.# !"#$$% ,-.#
&'(()*+
,-.# !"#$$% ,-.#
&'(()*+
!"#$$%
,-.# &'(()*+ ,-.#
push -c ’{ Figure 1: The structure of the PUSH shell
ORS=./blm.dis
du -an files |< xargs os chasen | awk ’{print $1}’ | sort | uniq -c >| sort -rn
}’
We have added two additional pipeline operators, a multiplexing fan-out(|<[n]), and a coalescing
fan-in(>|). This combination allows PUSH to distribute I/O to and from multiple simultaneous
threads of control. The fan-out argument n specifies the desired degree of parallel threading. If no
argument is specified, the default of spawning a new thread per record (up to the limit of available
cores) is used. This can also be overriden by command line options or environment variables. The
pipeline operators provide implicit grouping semantics allowing natural nesting and composibility.
While their complimentary nature usually lead to symmetric mappings (where the number of fan-
outs equal the number of fan-ins), there is nothing within our implementation which enforces it.
17. Strid3
Y= AX + Y
Time for 1024 iterations
Time
in
seconds
“Stride”, i.e. distance between scalars
18. Application Support
• Native
• Inferno Virtual Machine
• CNK Binary Support
• Elf Converter
• Extended proc interface to mark processes as “cnk procs”
• Transition once the process execs, and not before
• Shim in syscall trap code to adapt arg passing conventions
• Linux Binary Support
• Basic Linux binary support
• Functional enough to run basic programs (Python, etc.)
19. Publications
• Unified Execution Model for Cloud Computing; Eric Van Hensbergen, Noah Evans, Phillip Stanley-
Marbell. Submitted to LADIS 2009; October 2009.
• PUSH, a DISC Shell; Eric Van Hensbergen, Noah Evans. To Appear in the Proceedings of the Principles of
Distributed Computing Conference; August 2009.
• Measuring Kernel Throughput on BG/P with the Plan 9 Research Operating System; Ron Minnich, John
Floren, Aki Nyrhinen. Submitted to SC 09; November 2009.
• XCPU2: Distributed Seamless Desktop Extension; Eric Van Hensbergen, Latchesar Ionkov. Submitted to
IEEE Clusters 2009; October 2009.
• Service Oriented File Systems; Eric Van Hensbergen, Noah Evans, Phillip Stanley-Marbell. IBM Research
Report (RC24788), June 2009
• Experiences Porting the Plan 9 Research Operating System to the IBM Blue Gene Supercomputers; Ron
Minnich, Jim McKie. To appear in the Proceedings of the International Conference on Supercomputing
(ISC); June 2009.
• System Support for Many Task Computing; Eric Van Hensbergen and Ron Minnich. In the Proceedings of
the Workshop on Many Task Computing on Grids and Supercomputers; November 2008.
• Holistic Aggregate Resource Environment; Charles Forsyth, Jim McKie, Ron Minnich and Eric Van
Hensbergen. In the ACM Operating Systems Review; January 2008.
• Night of the Lepus: A Plan 9 Perspective on Blue Gene's Interconnects; Charles Forsyth, Jim McKie, Ron
Minnich and Eric Van Hensbergen. In the proceedings of the second annual international workshop on
Plan 9; December 2007
• Petascale Plan 9. USENIX 2007
20. Next Steps
• Infrastructure Scale Out
• File Services
• Command Execution
• Alternate Internode Communication Models
• Fail in place software RAS models
• Applications (Linux binaries and native support)
• Large Scale LINPACK Run
• Explore Mantevo Application Suite
• (http://software.sandia.gov/mantevo)
• CMU Working on Native Quake port
21. Acknowledgments
• Computational Resources Provided by
DOE INCITE Program. Thanks to the
patient folks at ANL who have supported
us bringing up Plan 9 on their development
BG/P
• Thanks to IBM Research Blue Gene team
and the Kittyhawk Team for guidance and
support.
24. IBM Research, Sandia National Labs, Bell Labs, and CMU
24 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
25. IBM Research, Sandia National Labs, Bell Labs, and CMU
Plan 9 Characteristics
Kernel Breakdown - Lines of Code
Architecture Specific Code
BG/P: ~14,000 lines of code
Portable Code
Port: ~25,000 lines of code
TCP/IP Stack: ~14,000 lines of code
Binary Sizes
415k Text + 140k Data + 107k BSS
25 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
26. IBM Research, Sandia National Labs, Bell Labs, and CMU
Why not Linux?
Not a distributed system
Core systems inflexible
VM based on x86 MMU
Networking tightly tied to sockets & TCP/IP w/long call-path
Typical installations extremely overweight and noisy
Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate
of change
Community has become conservative
Support for alternative interfaces waning
Support for large systems which hurts small systems not acceptable
Ultimately a customer constraint
FastOS was developed to prevent OS monoculture in HPC
Few Linux projects were even invited to submit final proposals
26 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
27. IBM Research, Sandia National Labs, Bell Labs, and CMU
Everything Represented as File Systems
Hardware System Application
Devices Services Services
Disk TCP/IP Stack DNS
/dev/hda1 /net /net
/arp /cs
/udp /dns
/dev/hda2 /tcp
/clone
/stats GUI
/win
/0
/clone
/1
Network /ctl
/0
/data
/1
/listen /ctl
/local /
/dev/eth0
/remote data
/status /
refresh
/2
Console, Audio, Etc. Process Control, Wiki, Authentication,
Debug, Etc. and Service Control
27 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
28. IBM Research, Sandia National Labs, Bell Labs, and CMU
Plan 9 Networks Screen
Phone PDA
Smartphone
Set Top Box
)
) )
Term
Term Term Term Wifi/Edge
Cable/DSL
Internet
LAN (1 GB/s) Network
File CPU CPU
Server Servers Servers
Content
Addressable
Storage High Bandwidth (10 GB/s) Network
28 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
29. IBM Research, Sandia National Labs, Bell Labs, and CMU
Aggregation as a First Class Concept
Local Service Proxy Service Aggregate Service
Remote Service Remote Service Remote Service
29 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
30. IBM Research, Sandia National Labs, Bell Labs, and CMU
Issues of Topology
30 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
31. IBM Research, Sandia National Labs, Bell Labs, and CMU
File Cache Example
Proxy Service
Monitors access to remote file server & local resources
Local cache mode
Collaborative cache mode
Designated cache server(s)
Integrate replication and redundancy
Explore write coherence via “territories” ala Envoy
Based on experiences with Xget deployment model
Leverage natural topology of machine where
possible.
31 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
32. IBM Research, Sandia National Labs, Bell Labs, and CMU
Monitoring Example
Distribute monitoring throughout the system
Use for system health monitoring and load balancing
Allow for application-specific monitoring agents
Distribute filtering & control agents at key points in
topology
Allow for localized monitoring and control as well as
high-level global reporting and control
Explore both push and pull methods of modeling
Based on experiences with supermon system.
32 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
33. IBM Research, Sandia National Labs, Bell Labs, and CMU
Workload Management Example
Provide file system interface to job execution and
scheduling.
Allows scheduling of new work from within the
cluster, using localized as well as global scheduling
controls.
Can allow for more organic growth of workloads as
well as top-down and bottom-up models.
Can be extended to allow direct access from end-
user workstations.
Based on experiences with Xcpu mechanism.
33 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
34. IBM Research, Sandia National Labs, Bell Labs, and CMU
Right Weight Kernels Project (Phase I)
Motivation
OS Effect on Applications
Metric is based on OS Interference on FWQ & FTQ benchmarks.
AIX/Linux has more capability than many apps need
LWK and CNK have less capability than apps want
Approach
Customize the kernel to the application
Ongoing Challenges
Need to balance capability with overhead
34 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
35. IBM Research, Sandia National Labs, Bell Labs, and CMU
Why Blue Gene?
Readily available large-scale cluster
Minimum allocation is 37 nodes
Easy to get 512 and 1024 node configurations
Up to 8192 nodes available upon request internally
FastOS will make 64k configuration available
DOE interest – Blue Gene was a specified target
Variety of interconnects allows exploration of alternatives
Embedded core design provides simple architecture that is quick to port to
and doesn't require heavy weight systems software management, device
drivers, or firmware
35 Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation