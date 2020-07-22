Successfully reported this slideshow.
The HPC middleware FUJITSU Software Technical Computing Suite (or simply Technical Computing Suite) developed by Fujitsu provides a exascale system operation and application environment for the K computer(*1) and other supercomputers. The CPUs mounted in the FUJITSU Supercomputer PRIMEHPC FX1000 (or simply PRIMEHPC FX1000) have the ARM(*2) architecture and the versatility to support a wide range of software, including Technical Computing Suite. Fujitsu shares the experience and technology gained from the development of this HPC middleware with the community so that we can improve HPC usability together.
The structure of the HPC middleware in the PRIMEHPC FX1000 is shown in the following figure, with an overview provided below.

https://www.fujitsu.com/downloads/SUPER/primehpc-fx1000-soft-en.pdf

  1. 1. Copyright 2012 FUJITSU LIMITED Technical Computing Suite Job Management Software PRIMERGY x86 cluster Supercomputer PRIMEHPC FX10 Toshiaki Mikamo Fujitsu Limited
  2. 2. Copyright 2012 FUJITSU LIMITED Outline System Configuration and Software Stack Features The major functions of job scheduler Efficient Resource Usage Fair Share Scheduling System-optimal Resource Assignment Summary and Future 1
  3. 3. Copyright 2012 FUJITSU LIMITED Hybrid System Configuration File management nodes Job management nodes Control nodes Login nodes User Management nodes Administrator Global file system (Data storage area) Local file system (Temporary area occupied by jobs) 6D mesh/torus Interconnect (Tofu) IO network (IB), management network (GbE) • Login • Compilation • Job submission • System operations management • Job operations management Supercomputer PRIMEHPC FX10 Fat-Tree Interconnect (Infiniband) PRIMERGY x86 cluster 2
  4. 4. Technical Computing Suite Copyright 2012 FUJITSU LIMITED System Software Stack User/ISV Applications High-performance file system  Lustre-based distributed file system  High scalability  IO bandwidth guarantee  High reliability & availability HPC Portal / System Management Portal Supercomputer PRIMEHPC FX10 System operations management  System configuration management  System control  System monitoring  System installation & operation Job operations management  Job manager  Job scheduler  Resource management  Parallel execution environment VISIMPACTTM  Shared L2 cache on a chip  Hardware intra-processor synchronization Compilers Support Tools MPI Library  Scalability of High-Func.  Barrier Comm.  IDE  Profiler & Tuning tools  Interactive debugger  Hybrid parallel programming  Sector cache support  SIMD / Register file extensions Linux-based enhanced Operating System PRIMERGY x86 cluster Red Hat Enterprise Linux 3
  5. 5. Copyright 2012 FUJITSU LIMITED Features  Same job operations in FX10 and PRIMERGY  Efficient, fair and system-optimal job scheduling  See slide below for details  Resource / Access control  Elapsed time limit / CPU time limit / Physical memory limit  Enable / Disable execute permission of job operation commands  Reduce OS jitter / Power saving control  Job statistical information  The amount of CPU time / Memory / IO  SIMD rate / MIPS / MFLOPS 4
  6. 6. Copyright 2012 FUJITSU LIMITED Job Scheduler Renew our job scheduler for large-scale system Our job scheduler features: Multi-process enable to coexist multiple scheduler in a cluster. Multi-thread enable to balance the load of scheduling. 5
  7. 7. Copyright 2012 FUJITSU LIMITED Efficient Resource Usage  Backfill scheduling for keeping the resources busy  Our scheduler manages space(compute nodes) and time.  It will backfill the low priority jobs so as not to prevent high priority jobs. Time Job A Job B Job D Running job Job D Job A Job B Job C Running job Job C Now t1 t2 t3 Not backfilled Backfilled  Job D Job C 6
  8. 8. Copyright 2012 FUJITSU LIMITED Fair Share Scheduling  Fairly share resources between users/groups based on past usage. ① Fair share value is issued in advance for each user/group. ② The value is changed by the result of resource usage. ③ The job execution priority is determined dynamically according to the value. Fair share value is like money. time Payment Return of overpaid Deposit Fairsharevalue (money) Payment[P] = (#Node allocated) x (Elapsed time limit of job) Deposit[D] = (Elapsed time) x (Recovery rate) Return of overpaid[R] = P - ((#Node allocated) x (Actual elapsed time of job)) 7
  9. 9. Copyright 2012 FUJITSU LIMITED Optimal Job Scheduling for FX10  Interconnect topology-aware resource assignment  One interconnect unit : 12 nodes (2 x 3 x 2)  Job assignment rule: rectangular solid shape  Guaranteeing neighbor communication  Avoiding interfering with other jobs  Rotates rectangular solid of interconnect unit to reduce fragmentation In-use unoccupied x z y 8 4 68 6 4 6 8 4 4 8 6 8
  10. 10.  Asynchronous file staging Copyright 2012 FUJITSU LIMITED Login nodes Global file system (Data storage area) IO network (IB), management network (GbE) PRIMEHPC FX10 Time Stage IN Stage IN Job A Job B Running job Job C Now t1 t2 t3 Compute nodes IO nodes Stage OUT Stage IN  Stage IN Asynchronously transfer files from Global to Local FS before the job starts. Co-scheduling of computation and file transfer. Interconnect IO nodes Compute nodes Local file system Stage IN/OUT Job A Stage OUT  Stage OUT Asynchronously transfer files from Local to Global FS after the job ends. Async. Async. Optimal Job Scheduling for FX10 9
  11. 11. Copyright 2012 FUJITSU LIMITED  Fine-grained node assignment  Node selection method : balancing / concentration  Rank placement policy : pack / unpack  Priority control of allocated nodes  Execution mode : node is occupied or not by a job.  Strict core assignment  Processes are bound to cores in the job territory.  No process can move to cores in other job territory. Node concentration Job A Job B Job C Node#0 Node#1 Node#2 Job D Job B Job A Rank unpack R0 R1R0 R1 Rank pack Node#1 Node#2Node#0 Node 0 1 2 3 4 5 6 7 Job A Job B P core Optimal Job Scheduling for PRIMERGY 10
  12. 12. Copyright 2012 FUJITSU LIMITED Summary and Future We developed the job management software. Unified operability on PRIMEHPC FX10 and PRIMERGY New job scheduler : Efficiency, Fairness and System-optimization Practical resource control and job statistical information Future Work Operation simulator Administrator will be able to simulate the operation situation subsequent to operation parameter changes. 11
  13. 13. Copyright 2012 FUJITSU LIMITED12

