Systems Support for Many Task Computing


Published on

A look at using aggregation as a first class construct within operating systems to enable scaling applications and services.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • To replace the title / subtitle with your own: Click on the title block -> select all the text by pressing Ctrl+A -> press Delete key -> type your own text
  • Systems Support for Many Task Computing

    1. 1. Systems Support for Many Task Computing Holistic Aggregate Resource Environment Eric Van Hensbergen (IBM) and Ron Minnich (Sandia National Labs)
    2. 2. Motivation
    3. 3. Overview of Approach <ul><li>Targeting Blue Gene/P </li><ul><li>provide a complimentary runtime environment </li></ul><li>Using Plan 9 Research Operating System </li><ul><li>“Right Weight Kernel” - balances simplicity and function
    4. 4. Built from the ground up as a distributed system </li></ul><li>Leverage HPC interconnects for system services
    5. 5. Distribute system services among compute nodes
    6. 6. Leverage aggregation as a first-class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency. </li></ul>
    7. 7. Related Work <ul><li>Default Blue Gene runtime </li><ul><li>Linux on I/O nodes + CNK on compute nodes </li></ul><li>High Throughput Computing (HTC) Mode
    8. 8. Compute Node Linux
    9. 9. ZeptoOS
    10. 10. Kittyhawk </li></ul>
    11. 11. Foundation: Plan 9 Distributed System <ul><li>Right Weight Kernel </li><ul><li>General purpose multi-thread, multi-user environment
    12. 12. Pleasantly portable
    13. 13. Relatively Lightweight (compared to Linux) </li></ul><li>Core Principles </li><ul><li>All resources are synthetic file hierarchies
    14. 14. Local & remote resources accessed via simple API
    15. 15. Each thread can dynamically organize local and remote resources via dynamic private namespace </li></ul></ul>
    16. 16. Everything Represented as File Systems Console, Audio, Etc. Wiki, Authentication, and Service Control Process Control, Debug, Etc. Hardware Devices System Services Application Services Disk Network TCP/IP Stack DNS GUI /dev/eth0 /net /arp /udp /tcp /clone /stats /0 /1 /ctl /data /listen /local /remote /status /net /cs /dns /win /clone /0 /1 /ctl /data /refresh /2 /dev/hda1 /dev/hda2
    17. 17. Plan 9 Networks Internet High Bandwidth (10 GB/s) Network LAN (1 GB/s) Network Wifi/Edge Cable/DSL Content Addressable Storage File Server CPU Servers CPU Servers PDA Smartphone Term Term Term Term Set Top Box Screen Phone )‏ )‏ )‏
    18. 18. An Issue of Scale Chip BG/p – 4 way System 72 Racks Node Card (4x4x2) 32 compute 0-2 IO cards Compute Card 2 chips Rack 32 Node Cards
    19. 19. Aggregation as a First Class Concept Local Service Aggregate Service Remote Service Proxy Service Remote Service Remote Service
    20. 20. Issues of Topology
    21. 21. File Cache Example <ul><li>Proxy Service </li><ul><li>Monitors access to remote file server & local resources
    22. 22. Local cache mode
    23. 23. Collaborative cache mode
    24. 24. Designated cache server(s)
    25. 25. Integrate replication and redundancy
    26. 26. Explore write coherence via “territories” ala Envoy </li></ul><li>Based on experiences with Xget deployment model
    27. 27. Leverage natural topology of machine where possible. </li></ul>
    28. 28. Monitoring Example <ul><li>Distribute monitoring throughout the system </li><ul><li>Use for system health monitoring and load balancing
    29. 29. Allow for application-specific monitoring agents </li></ul><li>Distribute filtering & control agents at key points in topology
    30. 30. Allow for localized monitoring and control as well as high-level global reporting and control
    31. 31. Explore both push and pull methods of modeling
    32. 32. Based on experiences with supermon system. </li></ul>
    33. 33. Workload Management Example <ul><li>Provide file system interface to job execution and scheduling.
    34. 34. Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls.
    35. 35. Can allow for more organic growth of workloads as well as top-down and bottom-up models.
    36. 36. Can be extended to allow direct access from end-user workstations.
    37. 37. Based on experiences with Xcpu mechanism. </li></ul>
    38. 38. Status <ul><li>Initial Port to BG/P 90% Complete
    39. 39. Applications </li><ul><li>Linux emulation environment
    40. 40. CNK emulation environment
    41. 41. Native ports of applications </li></ul><li>Also have a port of Inferno Virtual Machine to BG/P </li><ul><li>Runs on Kittyhawk as well as Native </li></ul><li>Baseline boot & runtime infrastructure complete </li></ul>
    42. 42. HARE Team <ul><li>David Eckhardt (Carnegie Mellon University)
    43. 43. Charles Forsyth (Vitanuova)
    44. 44. Jim McKie (Bell Labs)
    45. 45. Ron Minnich (Sandia National Labs)
    46. 46. Eric Van Hensbergen (IBM Research) </li></ul>
    47. 47. Thanks <ul><li>Funding </li><ul><li>This material is based upon work supported by the Department of Energy under Aware Number DE-FG02-08ER25851 </li></ul><li>Resources </li><ul><li>This work is being conducted on resources provided by the Department of Energy's Innovative and novel Computational Impact on Theory and Experiment (INCITE) </li></ul><li>Information </li><ul><li>The authors would also like to thank the IBM Research Blue Gene Team along with the IBM Research Kittyhawk team for their assistance. </li></ul></ul>
    48. 48. Questions? Discussion?
    49. 49. Links <ul><li>FastOS Web Site </li><ul><li> </li></ul><li>Phase II CFP </li><ul><li> </li></ul><li>BlueGene </li><ul><li> </li></ul><li>Plan 9 </li><ul><li> </li></ul><li>LibraryOS </li><ul><li> </li></ul></ul>
    50. 51. Plan 9 Characteristics <ul><li>Kernel Breakdown - Lines of Code </li><ul><li>Architecture Specific Code </li><ul><li>BG/L: ~10,000 lines of code </li></ul><li>Portable Code </li><ul><li>Port: ~25,000 lines of code
    51. 52. TCP/IP Stack: ~14,000 lines of code </li></ul></ul><li>Binary Sizes </li><ul><li>415k Text + 140k Data + 107k BSS </li></ul><li>Runtime Memory Footprint </li><ul><li>~4 MB for compute node kernels – could be smaller or larger depending on application specific tuning. </li></ul></ul>
    52. 53. Why not Linux? <ul><li>Not a distributed system
    53. 54. Core systems inflexible </li><ul><li>VM based on x86 MMU
    54. 55. Networking tightly tied to sockets & TCP/IP w/long call-path
    55. 56. Typical installations extremely overweight and noisy
    56. 57. Benefits of modularity and open-source advantages overcome by complexity, dependencies, and rapid rate of change </li></ul><li>Community has become conservative </li><ul><li>Support for alternative interfaces waning
    57. 58. Support for large systems which hurts small systems not acceptable </li></ul><li>Ultimately a customer constraint </li><ul><li>FastOS was developed to prevent OS monoculture in HPC
    58. 59. Few Linux projects were even invited to submit final proposals </li></ul></ul>
    59. 60. FTQ on BG/L IO Node running Linux
    60. 61. FTQ on BG/L IO Node Running Plan 9
    61. 62. Right Weight Kernels Project (Phase I) <ul><li>Motivation </li><ul><li>OS Effect on Applications </li><ul><li>Metric is based on OS Interference on FWQ & FTQ benchmarks. </li></ul><li>AIX/Linux has more capability than many apps need
    62. 63. LWK and CNK have less capability than apps want </li></ul><li>Approach </li><ul><li>Customize the kernel to the application </li></ul><li>Ongoing Challenges </li><ul><li>Need to balance capability with overhead </li></ul></ul>
    63. 64. Why Blue Gene? <ul><li>Readily available large-scale cluster </li><ul><li>Minimum allocation is 37 nodes
    64. 65. Easy to get 512 and 1024 node configurations
    65. 66. Up to 8192 nodes available upon request internally
    66. 67. FastOS will make 64k configuration available </li></ul><li>DOE interest – Blue Gene was a specified target
    67. 68. Variety of interconnects allows exploration of alternatives
    68. 69. Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware </li></ul>
    69. 70. Department of Energy FastOS CFP aka: Operating and Runtime System for Extreme Scale Scientific Computation (DE-PS02-07ER07-23) <ul><li>Goal: </li></ul>Stimulate R&D related to operating and runtime systems for petascale systems in the 2010 to 2015 time frame. <ul><li>Expected Output </li></ul>Unified operating and runtime system that could fully support and exploit petascale and beyond systems. <ul><li>Near Term Hardware Targets: </li><ul><li>Blue Gene, Cray XD3, and HPCS Machines. </li></ul></ul>
    70. 71. Blue Gene Interconnects 3 Dimensional Torus <ul><ul><li>Interconnects all compute nodes (65,536)
    71. 72. Virtual cut-through hardware routing
    72. 73. 1.4Gb/s on all 12 node links (2.1 GB/s per node)
    73. 74. 1 µs latency between nearest neighbors, 5 µs to the farthest
    74. 75. 4 µs latency for one hop with MPI, 10 µs to the farthest
    75. 76. Communications backbone for computations
    76. 77. 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth </li></ul></ul>Global Tree <ul><ul><li>One-to-all broadcast functionality
    77. 78. Reduction operations functionality
    78. 79. 2.8 Gb/s of bandwidth per link
    79. 80. Latency of one way tree traversal 2.5 µs
    80. 81. ~23TB/s total binary tree bandwidth (64k machine)
    81. 82. Interconnects all compute and I/O nodes (1024) </li></ul></ul>Ethernet <ul><ul><li>Incorporated into every node ASIC
    82. 83. Active in the I/O nodes (1:64)
    83. 84. All external comm. (file I/O, control, user interaction, etc.) </li></ul></ul>Low Latency Global Barrier and Interrupt <ul><ul><li>Latency of round trip 1.3 µs </li></ul></ul>Control Network