Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. ...
Outline <ul><li>Introduction </li></ul><ul><li>Performance Studies </li></ul><ul><li>Optimization for Large Scale Computat...
Parallel Matlab (pMatlab) <ul><li>Layered Architecture for parallel computing </li></ul><ul><li>Kernel layer does single-n...
IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
Blue Gene Application Paths Serial and Pleasantly  Parallel Apps Highly Scalable Message Passing Apps High Throughput  Com...
HTC Node Modes on BG/P <ul><li>Symmetrical Multiprocessing (SMP) mode </li></ul><ul><ul><li>One process per compute node <...
Porting pMatlab to BG/P System <ul><li>Requesting and booting a BG partition in HTC mode </li></ul><ul><ul><li>Execute “qs...
Outline <ul><li>Introduction </li></ul><ul><li>Performance Studies </li></ul><ul><li>Optimization for Large Scale Computat...
Performance Studies <ul><li>Single Processor Performance </li></ul><ul><ul><li>MandelBrot  </li></ul></ul><ul><ul><li>Zoom...
Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a su...
Octave Performance With  Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab T...
Point-to-Point Communication <ul><li>pMatlab example: pSpeed </li></ul><ul><ul><li>Send/Receive messages to/from the neigh...
Filesystem Consideration <ul><li>A single NFS-shared disk (Mode S) </li></ul><ul><li>A group of cross-mounted, NFS-shared ...
pSpeed Performance on LLGrid:  Mode S Higher is better
pSpeed Performance on LLGrid: Mode M Lower is better Higher is better
pSpeed Performance on BG/P BG/P Filesystem: GPFS
pStream Results with Scaled Size <ul><li>SMP mode: Initial global array size of 2^25 for Np=1 </li></ul><ul><ul><li>Global...
pStream Results with Fixed Size <ul><li>Global array size of 2^30 </li></ul><ul><ul><li>The number of processes scaled up ...
Outline <ul><li>Introduction </li></ul><ul><li>Performance Studies </li></ul><ul><li>Optimization for Large Scale Computat...
Current Aggregation Architecture <ul><li>The leader process receives all the distributed data from other processes. </li><...
Binary-Tree Based Aggregation <ul><li>BAGG: Distributed message collection using a binary tree </li></ul><ul><ul><li>The e...
BAGG() Performance <ul><li>Two dimensional data and process distribution </li></ul><ul><li>Two different file systems are ...
BAGG() Performance, 2 <ul><li>Four dimensional data and process distribution </li></ul><ul><li>With GPFS file system on IB...
Generalizing Binary-Tree Based Aggregation <ul><li>HAGG: Extend the binary tree to the next power of two number  </li></ul...
BAGG() vs. HAGG() <ul><li>HAGG() generalizes BAGG()  </li></ul><ul><ul><li>Removes the restriction (Np = 2^N) in BAGG() </...
BAGG() vs. HAGG(), 2 <ul><li>Performance comparison on four dimensional data and process distribution  </li></ul><ul><li>P...
BAGG() Performance with Crossmounts <ul><li>Significant performance improvement by reducing resource contention on file sy...
Summary <ul><li>pMatlab has been ported to IBM Blue Gene/P system </li></ul><ul><li>Clock-normalized, single process perfo...
Upcoming SlideShare
Loading in …5
×

pMatlab on BlueGene

923 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
923
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 1 1 Title of the presentation to be presented at HPEC 2010, September 15-16, 2010.
  • Outline of the presentation. Introduction will brief about motivation on porting pMatlab to mega-core systems such as IBM Blue Gene/P system and how we did it. Then performance details of the math kernel (octave) for pMatlab on BG/P is given. Also, we will present the technical details and performance improvement done for aggregation in order to work on a large scale computation.
  • 1 1 pMatlab is the library layer and abstracts away the parallel distribution details so that the user calls pMatlab functions, which provide mechanisms for easily distributing and parallelizing Matlab code. pMatlab can use either MATLAB or Octave as the mathematics kernel, just as the messaging can accomplished with any library which can send and receive MATLAB messages. For our current implementation MatlabMPI is the communication kernel. pMatlab calls MatlabMPI functions to communicate data between processors.
  • IBM Blue Gene/P system design diagram Each CPU runs at 850MHZ and may have 2GB or 4GB memory four cores. The Surveyor system at Argonne National Lab has 2GB memory, which was used for this study. One full rack has 1024 CPUs or 4096 compute cores.
  • IBM Blue Gene/P system supportss two major computing environment. High Throughput Computing (HTC) environment is ideal for serial and pleasantly parallel applications. High Performance Computing environment targets for highly scalable message passing applications with MPI library. HTC environment fits well for porting pMatlab environment to BG/P.
  • In Symmetric Multiprocessing (SMP) mode each physical Compute Node executes a single process per node with a default maximum of four threads. The process can access the full physical memory. With Dual and VN mode for HTC, only ½ and ¼ of the full memory is accessible by each process. For example, the Surveyor system has 2GB memory per compute node. So a process with SMP mode has approximately 2GB minus the space occupied by the OS. DUAL-mode process has about 1GB RAM and VN-mode process has about 512MB RAM.
  • Porting pMatlab to BG/P system involves how to integrate BG/P job launch mechanism to the pMatlab job launch mechanism. It involves two steps to run pMatlab jobs on BG/P. First, it requests and boots a BG partition in HTC mode using the qsub command on Surveyor. Then, once the partition gets booted, execute the submit command to submit a job to the partition. By combining the two step under the pRUN command, pMatlab jobs can be executed on BG/P.
  • Outline Slide for performance studies. We will talk about single process performance, point-to-point communication performance and scalability results using parallel stream benchmark.
  • Examples used for the performance studies, which are available from the pMatlab package.
  • Performance of example codes running on a single processor core is compared between Intel Xeon and IBM PowerPC chips
  • Using an optimized BLAS library (ATLAS), Octave 3.2.4 binary was able to run faster than Matlab 2009a in DGEMM and HPL benchmark examples.
  • Stream benchmark performance is compared between LLGrid and Surveyor (IBM BG/P) systems
  • Point-to-point communication performance on BG/P is measured by using pSpeed, a pMatlab code example.
  • Because messages are files in pMatlab, it is important to have efficient file system in order to scale for a large number of processes. In a single nfs-shared file system, this will have serious resource contention as the number of processes increase. However, by using a group of cross-mounted distributed file system, we may mitigate this issue. The next slide shows a demonstration of resource contention with file system.
  • The pSpeed performance with MATLAB was measured when all messages were written on a single, nfs-shared disk. It started showing resource contention with four processes and with 8 processes, the issue is obvious.
  • When using crossmounts for pMatlab message communication, the jittery behavior has been resolved. The overall performance of Octave matches well with that of Matlab when using four and eight processes.
  • The point-to-point communication (pSpeed) on BG/P system performed as expected when using IBM’s proprietary parallel file system, GPFS
  • Now we are looking at different perspective on scalability. We increase the problem size proportionally as we increase the number of processes so that each process has the amount of work everytime. With SMP mode, the initial global array size is set to 2^25 with Np=1 which is the largest size we can run on a single process. As scale the problem size proportionally with the number of processes, the bandwidth scales linearly. At Np=1024, Triad bandwidth is 563GB/sec with 100% parallel efficiency. At Np=1024, 1024 nodes and 1 processes per node
  • pStream benchmark was able to scaled up to 16,384 processes on an IBM internal BG/P machine
  • Outline Slide for aggregation optimization done for large scale computation
  • The current aggregation architecture is easy to implement but it is inherently sequential. Therefore, the collection process will slow down significantly.
  • Binary tree based aggregation can significantly reduce the number of messages that a process may send and receive although message itself becomes larger. For example, when Np =8, the leader process receives three messages with BAGG() while it receives 7 messages with AGG().
  • On MIT Lincoln Laboratory cluster system, the performance of AGG() and BAGG() is compared for a 2-D data and process distribution using two different file systems. At large process numbers, BAGG() performs significantly better than AGG(). With 128 processes, BAGG() achieved 10x improvement on IBRIX and 8x improvement on LUSTRE file systems. The performance number is highly dependent on the file system loads since they were obtained on a production cluster.
  • On IBM Blue Gene/P system, the performance comparison was done for a wide range of processes, from 8 to 1024 processes by using a 4-D data and process distribution. With GPFS file system, BAGG() runs 2.5x faster than AGG().
  • BAGG() only works when Np is a power of two number. We generalized BAGG() so that it can work with any number of processes. The generalization is based on an extended binary tree so that we can still use the binary tree based aggregation architecture. There will be some additional costs associated with those fictitious processes. For those fictitious processes, we skip message communication.
  • The costs associated with additional bookkeeping appears to be minimal. On a production MIT Lincoln Laboratory cluster, the performance of AGG(), BAGG() and HAGG() were compared. At Np = 128, BAGG() and HAGG() runs about 3x faster than AGG()
  • On an IBM Blue Gene/P system, the cost of additional bookkeeping was studied on a wide range of processes using a 4-D data and process distribution using GPFS file system. Both performs almost identically through out the entire range of processes, from 8 to 512 processes.
  • On LLGrid system with using a hybrid configuration (central file system for the lead process and crossmounts for compute nodes), the new agg() function can reduce resource contention on file system. Thus, it improved agg() performance significantly.
  • Summary slide
  • pMatlab on BlueGene

    1. 1. Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E. Jordan IBM T.J. Watson Research Center HPEC 2010 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
    2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Performance Studies </li></ul><ul><li>Optimization for Large Scale Computation </li></ul><ul><li>Summary </li></ul><ul><li>What is Parallel Matlab (pMatlab) </li></ul><ul><li>IBM Blue Gene/P System </li></ul><ul><li>BG/P Application Paths </li></ul><ul><li>Porting pMatlab to BG/P </li></ul>
    3. 3. Parallel Matlab (pMatlab) <ul><li>Layered Architecture for parallel computing </li></ul><ul><li>Kernel layer does single-node math & parallel messaging </li></ul><ul><li>Library layer provides a parallel data and computation toolbox to Matlab users </li></ul>Library Layer (pMatlab) Vector/Matrix Comp Task Conduit Application Parallel Library Parallel Hardware Input Analysis Output User Interface Hardware Interface Kernel Layer Math ( MATLAB/Octave ) Messaging ( MatlabMPI )
    4. 4. IBM Blue Gene/P System Core speed: 850 MHz cores LLGrid Core counts: ~1K Blue Gene/P Core counts: ~300K
    5. 5. Blue Gene Application Paths Serial and Pleasantly Parallel Apps Highly Scalable Message Passing Apps High Throughput Computing (HTC) High Performance Computing (MPI) Blue Gene Environment <ul><li>High Throughput Computing (HTC) </li></ul><ul><ul><li>Enabling BG partition for many single-node jobs </li></ul></ul><ul><ul><li>Ideal for “pleasantly parallel” type applications </li></ul></ul>
    6. 6. HTC Node Modes on BG/P <ul><li>Symmetrical Multiprocessing (SMP) mode </li></ul><ul><ul><li>One process per compute node </li></ul></ul><ul><ul><li>Full node memory available to the process </li></ul></ul><ul><li>Dual mode </li></ul><ul><ul><li>Two processes per compute node </li></ul></ul><ul><ul><li>Half of the node memory per each process </li></ul></ul><ul><li>Virtual Node (VN) mode </li></ul><ul><ul><li>Four processes per compute node (one per core) </li></ul></ul><ul><ul><li>1/4 th of the node memory per each process </li></ul></ul>
    7. 7. Porting pMatlab to BG/P System <ul><li>Requesting and booting a BG partition in HTC mode </li></ul><ul><ul><li>Execute “qsub” command </li></ul></ul><ul><ul><ul><li>Define number of processes, runtime, HTC boot script ( htcpartition --trace 7 --boot --mode dual </li></ul></ul></ul><ul><ul><ul><li>--partition $COBALT_PARTNAME ) </li></ul></ul></ul><ul><ul><ul><li>Wait for the partition ready (until the boot completes) </li></ul></ul></ul><ul><li>Running jobs </li></ul><ul><ul><li>Create and execute a Unix shell script to run a series of “submit” commands including </li></ul></ul><ul><ul><ul><li>submit -mode dual -pool ANL-R00-M1-512 </li></ul></ul></ul><ul><ul><ul><li>-cwd /path/to/working/dir -exe /path/to/octave </li></ul></ul></ul><ul><ul><ul><li>-env LD_LIBRARY_PATH=/home/cbyun/lib </li></ul></ul></ul><ul><ul><ul><li>-args “--traditional MatMPI/MatMPIdefs523.m” </li></ul></ul></ul><ul><li>Combine the two steps </li></ul><ul><ul><ul><li>eval(pRUN(‘m_file’, Nprocs, ‘bluegene-smp’)) </li></ul></ul></ul>
    8. 8. Outline <ul><li>Introduction </li></ul><ul><li>Performance Studies </li></ul><ul><li>Optimization for Large Scale Computation </li></ul><ul><li>Summary </li></ul><ul><li>Single Process Performance </li></ul><ul><li>Point-to-Point Communication </li></ul><ul><li>Scalability </li></ul>
    9. 9. Performance Studies <ul><li>Single Processor Performance </li></ul><ul><ul><li>MandelBrot </li></ul></ul><ul><ul><li>ZoomImage </li></ul></ul><ul><ul><li>Beamformer </li></ul></ul><ul><ul><li>Blurimage </li></ul></ul><ul><ul><li>Fast Fourier Transform (FFT) </li></ul></ul><ul><ul><li>High Performance LINPACK (HPL) </li></ul></ul><ul><li>Point-to-Point Communication </li></ul><ul><ul><li>pSpeed </li></ul></ul><ul><li>Scalability </li></ul><ul><ul><li>Parallel Stream Benchmark: pStream </li></ul></ul>
    10. 10. Single Process Performance: Intel Xeon vs. IBM PowerPC 450 * conv2() performance issue in Octave has been improved in a subsequent release Lower is better HPL MandelBrot ZoomImage* Beamformer Blurimage* FFT 26.4 s 11.9 s 18.8 s 6.1 s 1.5 s 5.2 s
    11. 11. Octave Performance With Optimized BLAS DGEM Performance Comparison Matrix size (N x N) MFLOPS
    12. 12. Single Process Performance: Stream Benchmark Higher is better 944 MB/s 1208 MB/s 996 MB/s Relative Performance to Matlab Triad a = b + q c Add c = a + c Scale b = q c
    13. 13. Point-to-Point Communication <ul><li>pMatlab example: pSpeed </li></ul><ul><ul><li>Send/Receive messages to/from the neighbor. </li></ul></ul><ul><ul><li>Messages are files in pMatlab. </li></ul></ul>Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2
    14. 14. Filesystem Consideration <ul><li>A single NFS-shared disk (Mode S) </li></ul><ul><li>A group of cross-mounted, NFS-shared disks to distribute messages (Mode M) </li></ul>Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2 Pid = 0 Pid = 1 Pid = Np-1 Pid = 3 Pid = 2
    15. 15. pSpeed Performance on LLGrid: Mode S Higher is better
    16. 16. pSpeed Performance on LLGrid: Mode M Lower is better Higher is better
    17. 17. pSpeed Performance on BG/P BG/P Filesystem: GPFS
    18. 18. pStream Results with Scaled Size <ul><li>SMP mode: Initial global array size of 2^25 for Np=1 </li></ul><ul><ul><li>Global array size scales proportionally as number of processes increases (1024x1) </li></ul></ul>563 GB/sec 100 % Efficiency at Np = 1024
    19. 19. pStream Results with Fixed Size <ul><li>Global array size of 2^30 </li></ul><ul><ul><li>The number of processes scaled up to 16384 (4096x4) </li></ul></ul>VN: 8.832 TB/Sec 101% efficiency at Np=16384 DUAL: 4.333 TB/Sec 96% efficiency at Np=8192 SMP: 2.208 TB/Sec 98% efficiency at Np=4096
    20. 20. Outline <ul><li>Introduction </li></ul><ul><li>Performance Studies </li></ul><ul><li>Optimization for Large Scale Computation </li></ul><ul><li>Summary </li></ul><ul><li>Aggregation </li></ul>
    21. 21. Current Aggregation Architecture <ul><li>The leader process receives all the distributed data from other processes. </li></ul><ul><li>All other processes send their portion of the distributed data to the leader process. </li></ul><ul><li>The process is inherently sequential. </li></ul><ul><ul><li>The leader receives Np-1 messages. </li></ul></ul>Np = 8 Np: total number of processes 0 1 2 3 4 5 6 7 8
    22. 22. Binary-Tree Based Aggregation <ul><li>BAGG: Distributed message collection using a binary tree </li></ul><ul><ul><li>The even numbered processes send a message to its odd numbered neighbor </li></ul></ul><ul><ul><li>The odd numbered processes receive a message from its even numbered neighbor. </li></ul></ul>Maximum number of message a process may send/receive is N, where Np = 2^(N) 0 1 2 3 4 5 6 7 0 2 4 6 0 4
    23. 23. BAGG() Performance <ul><li>Two dimensional data and process distribution </li></ul><ul><li>Two different file systems are used for performance comparison </li></ul><ul><ul><li>IBRIX: file system for users’ home directories </li></ul></ul><ul><ul><li>LUSTRE: parallel file system for all computation </li></ul></ul>IBRIX: 10x faster at Np=128 LUSTRE: 8x faster at Np=128
    24. 24. BAGG() Performance, 2 <ul><li>Four dimensional data and process distribution </li></ul><ul><li>With GPFS file system on IBM Blue Gene/P System (ANL’s Surveyor) </li></ul><ul><ul><li>From 8 processes to 1024 processes </li></ul></ul>2.5x faster at Np=1024
    25. 25. Generalizing Binary-Tree Based Aggregation <ul><li>HAGG: Extend the binary tree to the next power of two number </li></ul><ul><ul><li>Suppose that Np = 6 </li></ul></ul><ul><ul><ul><li>The next power of two number: Np* = 8 </li></ul></ul></ul><ul><ul><li>Skip any messages from/to the fictitious Pid’s. </li></ul></ul>0 1 2 3 4 5 6 7 0 2 4 6 0 4
    26. 26. BAGG() vs. HAGG() <ul><li>HAGG() generalizes BAGG() </li></ul><ul><ul><li>Removes the restriction (Np = 2^N) in BAGG() </li></ul></ul><ul><ul><li>Additional costs associated with bookkeeping </li></ul></ul><ul><li>Performance comparison on two dimensional data and process distribution </li></ul>~3x faster at Np=128
    27. 27. BAGG() vs. HAGG(), 2 <ul><li>Performance comparison on four dimensional data and process distribution </li></ul><ul><li>Performance difference is marginal on a dedicated environment </li></ul><ul><ul><li>SMP mode on IBM Blue Gene/P System </li></ul></ul>
    28. 28. BAGG() Performance with Crossmounts <ul><li>Significant performance improvement by reducing resource contention on file system </li></ul><ul><ul><li>Performance is jittery because production cluster is used for performance test </li></ul></ul>Lower is better
    29. 29. Summary <ul><li>pMatlab has been ported to IBM Blue Gene/P system </li></ul><ul><li>Clock-normalized, single process performance of Octave on BG/P system is on-par with Matlab </li></ul><ul><li>For pMatlab point-to-point communication (pSpeed), file system performance is important. </li></ul><ul><ul><li>Performance is as expected with GPFS on BG/P </li></ul></ul><ul><li>Parallel Stream Benchmark scaled to 16384 processes </li></ul><ul><li>Developed a new pMatlab aggregation function using a binary tree to scale beyond 1024 processes </li></ul>

    ×