This document discusses implementing heterogeneous MPI and hybrid parallel computation in climate modeling components. It describes creating multiple MPI communicators to allow some tasks to perform MPI computations while others perform hybrid OpenMP computations simultaneously. Initial experiments on AIX showed idle MPI tasks consuming resources, but workarounds using AIX settings like MP_WAIT_MODE improved performance. Simple tests integrating MPI and hybrid modes for pi calculation showed good strong scaling up to 4 nodes with little difference in execution time between modes. Future work involves more substantial computations, dynamic interrupt settings, and portability testing on other platforms.
VaMoS 2022 - Transfer Learning across Distinct Software SystemsLuc Lesoil
Many research studies predict the performance of configurable software using machine learning techniques, thus requiring large amounts of data. Transfer learning aims to reduce the amount of data needed to train these models and has been successfully applied on different executing environments (hardware) or software versions. In this paper we investigate for the first time the idea of applying transfer learning between distinct configurable systems. We design a study involving two video encoders (namely x264 and x265) coming from different code bases. Our results are encouraging since transfer learning outperforms traditional learning for two performance properties (out of three). We discuss the open challenges to overcome for a more general application.
SPLC 2021 - The Interplay of Compile-time and Run-time Options for Performan...Luc Lesoil
Many software projects are configurable through compile-time options (e.g. using ./configure) and also through run-time options (e.g. command-line parameters, fed to the software at the execution).
Several works have studied how to predict the effect of run-time options on performance.
However it is unclear how these prediction models generalize when the software is built with different values of compile-time options.
For instance, is the best run-time configuration always the best w.r.t. the chosen compilation options?
In this paper, we investigate the effect of compile-time options on the performance distributions of 5 software systems.
We prove there can exist an interplay between the compile-time and the execution levels, by exhibiting a case where different compilation options significantly alter the performance distributions of a software system.
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Pooyan Jamshidi
Modern systems (e.g., deep neural networks, big data analytics, and compilers) are highly configurable, which means they expose different performance behavior under different configurations. The fundamental challenge is that one cannot simply measure all configurations due to the sheer size of the configuration space. Transfer learning has been used to reduce the measurement efforts by transferring knowledge about performance behavior of systems across environments. Previously, research has shown that statistical models are indeed transferable across environments. In this work, we investigate identifiability and transportability of causal effects and statistical relations in highly-configurable systems. Our causal analysis agrees with previous exploratory analysis~\cite{Jamshidi17} and confirms that the causal effects of configuration options can be carried over across environments with high confidence. We expect that the ability to carry over causal relations will enable effective performance analysis of highly-configurable systems.
This Virtual User Group session, held on 2014-01-22, presents some of the techniques and algorithms used to improve the CPLEX MIP solver in versions 12.5.1 and 12.6.
VaMoS 2022 - Transfer Learning across Distinct Software SystemsLuc Lesoil
Many research studies predict the performance of configurable software using machine learning techniques, thus requiring large amounts of data. Transfer learning aims to reduce the amount of data needed to train these models and has been successfully applied on different executing environments (hardware) or software versions. In this paper we investigate for the first time the idea of applying transfer learning between distinct configurable systems. We design a study involving two video encoders (namely x264 and x265) coming from different code bases. Our results are encouraging since transfer learning outperforms traditional learning for two performance properties (out of three). We discuss the open challenges to overcome for a more general application.
SPLC 2021 - The Interplay of Compile-time and Run-time Options for Performan...Luc Lesoil
Many software projects are configurable through compile-time options (e.g. using ./configure) and also through run-time options (e.g. command-line parameters, fed to the software at the execution).
Several works have studied how to predict the effect of run-time options on performance.
However it is unclear how these prediction models generalize when the software is built with different values of compile-time options.
For instance, is the best run-time configuration always the best w.r.t. the chosen compilation options?
In this paper, we investigate the effect of compile-time options on the performance distributions of 5 software systems.
We prove there can exist an interplay between the compile-time and the execution levels, by exhibiting a case where different compilation options significantly alter the performance distributions of a software system.
Transfer Learning for Performance Analysis of Configurable Systems:A Causal ...Pooyan Jamshidi
Modern systems (e.g., deep neural networks, big data analytics, and compilers) are highly configurable, which means they expose different performance behavior under different configurations. The fundamental challenge is that one cannot simply measure all configurations due to the sheer size of the configuration space. Transfer learning has been used to reduce the measurement efforts by transferring knowledge about performance behavior of systems across environments. Previously, research has shown that statistical models are indeed transferable across environments. In this work, we investigate identifiability and transportability of causal effects and statistical relations in highly-configurable systems. Our causal analysis agrees with previous exploratory analysis~\cite{Jamshidi17} and confirms that the causal effects of configuration options can be carried over across environments with high confidence. We expect that the ability to carry over causal relations will enable effective performance analysis of highly-configurable systems.
This Virtual User Group session, held on 2014-01-22, presents some of the techniques and algorithms used to improve the CPLEX MIP solver in versions 12.5.1 and 12.6.
Presented for the first time at INFORMS in November 2013, this deck explains how CPLEX 12.5.1 exploits random performance variability through parallel root cut loops.
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...Luc Lesoil
Configuring software is a powerful mean to reach functional and performance goals of a system. However, many layers (hardware, operating system, input data, etc), themselves subject to variability, can alter performances of software configurations. For instance, configurations' options of the x264 video encoder may have very different effects on x264's encoding time when used with different input videos, depending on the hardware on which it is executed. In this vision paper, we coin the term deep software variability to refer to the interaction of all external layers modifying the behavior or non-functional properties of a software. Deep software variability challenges practitioners and researchers: the combinatorial explosion of possible executing environments complicates the understanding, the configuration, the maintenance, the debug, and the test of configurable systems. There are also opportunities: harnessing all variability layers (and not only the software layer) can lead to more efficient systems and configuration knowledge that truly generalizes to any usage and context.
In November 2013, at INFORMS, we introduced one of the new features in CPLEX 12.6: Distributed MIP. This gives you the ability to solve a single MIP problem on several computers.
Design and Implementation of Discrete Augmented Ziegler-Nichols PID ControllerIDES Editor
Although designing and tuning a proportionalintegral-
derivative (PID) controller appears to be conceptually
intuitive, but it can be hard in practice, if multiple (and often
conflicting) objectives such as short transient and high
stability are to be achieved. Traditionally Ziegler Nichols is
widely accepted PID tuning method but it’s performance is
not accepted for systems where precise control is required. To
overcome this problem, the online gain updating method
Augmented Ziegler-Nichols PID (AZNPID) was proposed, with
the amelioration of Ziegler-Nichols PID’s (ZNPID’s) tuning
rule. This study is further extension of [1] for making the
scheme more generalized. With the help of fourth order
Runge-Kutta method, differential equations involved in PID
are solved which significantly improves transient performance
of AZNPID compared to ZNPID. The proposed augmented
ZNPID (AZNPID) is tested on various types of linear processes
and shows improved performance over ZNPID. The results of
the proposed scheme is validated by simulation and also
verified experimentally by implementing on Quanser’s real
time servo-based position control system SRV-02.
This presentation was first given at INFORMS in November 2013. It presents an analysis of the features that had the most impact on MIP solver performance during the last 12 years.
More presentations are available at https://www.ibm.com/developerworks/community/groups/community/DecisionOptimization
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)Brenno Menezes
We propose a mixed-integer linear (MILP) model for the design of assignments of various raw materials with different qualities when moving them from external supply sources to shared storages. This is especially important in process industries with limited storage and quality blend programs optimizing a plant feed diet for ongoing operations involving process units, inventory control and product demands, as found in crude-oil, ore/metal and food processing industries. This novel storage assignment problem minimizes the quality deviation when a larger number of feedstocks from marine vessels or ships are clustered into a smaller number of containers or storages in the plant, known as the Pigeonhole Principle, allocating the raw material to a definite place in an orderly system. Although the model only uses raw material quality data and neglects logistics details such as raw material supply amounts, timing and volume available in the storage, the simplification can be partially circumvented by splitting the raw material into two or more species with same qualities in order to fit into the storages. Examples dealing with 5 to 45 different crude-oil feedstocks clustered into 4 storage tanks demonstrate the proposed model, which yields the optimum storage assignment within minutes for industrial-scale problems.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This talk was given at INFORMS in November 2014. It presents some of the recent improvements made in CPLEX 12.6.1.
Topics include performance improvements, Local Implied Bound cuts, support for Python 3, Opportunistic Distributed MIP, and MIQP linearization.
Performance Improvement of existing Cache Replacement Policies - LRU, RRIP, AIRRIP, ABRRIP Replacement Policies implementation on GEM5 Simulator, and its testing using SPEC06 Benchmark Suite.
Presented for the first time at INFORMS in November 2013, this deck explains how CPLEX 12.5.1 exploits random performance variability through parallel root cut loops.
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...Luc Lesoil
Configuring software is a powerful mean to reach functional and performance goals of a system. However, many layers (hardware, operating system, input data, etc), themselves subject to variability, can alter performances of software configurations. For instance, configurations' options of the x264 video encoder may have very different effects on x264's encoding time when used with different input videos, depending on the hardware on which it is executed. In this vision paper, we coin the term deep software variability to refer to the interaction of all external layers modifying the behavior or non-functional properties of a software. Deep software variability challenges practitioners and researchers: the combinatorial explosion of possible executing environments complicates the understanding, the configuration, the maintenance, the debug, and the test of configurable systems. There are also opportunities: harnessing all variability layers (and not only the software layer) can lead to more efficient systems and configuration knowledge that truly generalizes to any usage and context.
In November 2013, at INFORMS, we introduced one of the new features in CPLEX 12.6: Distributed MIP. This gives you the ability to solve a single MIP problem on several computers.
Design and Implementation of Discrete Augmented Ziegler-Nichols PID ControllerIDES Editor
Although designing and tuning a proportionalintegral-
derivative (PID) controller appears to be conceptually
intuitive, but it can be hard in practice, if multiple (and often
conflicting) objectives such as short transient and high
stability are to be achieved. Traditionally Ziegler Nichols is
widely accepted PID tuning method but it’s performance is
not accepted for systems where precise control is required. To
overcome this problem, the online gain updating method
Augmented Ziegler-Nichols PID (AZNPID) was proposed, with
the amelioration of Ziegler-Nichols PID’s (ZNPID’s) tuning
rule. This study is further extension of [1] for making the
scheme more generalized. With the help of fourth order
Runge-Kutta method, differential equations involved in PID
are solved which significantly improves transient performance
of AZNPID compared to ZNPID. The proposed augmented
ZNPID (AZNPID) is tested on various types of linear processes
and shows improved performance over ZNPID. The results of
the proposed scheme is validated by simulation and also
verified experimentally by implementing on Quanser’s real
time servo-based position control system SRV-02.
This presentation was first given at INFORMS in November 2013. It presents an analysis of the features that had the most impact on MIP solver performance during the last 12 years.
More presentations are available at https://www.ibm.com/developerworks/community/groups/community/DecisionOptimization
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)Brenno Menezes
We propose a mixed-integer linear (MILP) model for the design of assignments of various raw materials with different qualities when moving them from external supply sources to shared storages. This is especially important in process industries with limited storage and quality blend programs optimizing a plant feed diet for ongoing operations involving process units, inventory control and product demands, as found in crude-oil, ore/metal and food processing industries. This novel storage assignment problem minimizes the quality deviation when a larger number of feedstocks from marine vessels or ships are clustered into a smaller number of containers or storages in the plant, known as the Pigeonhole Principle, allocating the raw material to a definite place in an orderly system. Although the model only uses raw material quality data and neglects logistics details such as raw material supply amounts, timing and volume available in the storage, the simplification can be partially circumvented by splitting the raw material into two or more species with same qualities in order to fit into the storages. Examples dealing with 5 to 45 different crude-oil feedstocks clustered into 4 storage tanks demonstrate the proposed model, which yields the optimum storage assignment within minutes for industrial-scale problems.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This talk was given at INFORMS in November 2014. It presents some of the recent improvements made in CPLEX 12.6.1.
Topics include performance improvements, Local Implied Bound cuts, support for Python 3, Opportunistic Distributed MIP, and MIQP linearization.
Performance Improvement of existing Cache Replacement Policies - LRU, RRIP, AIRRIP, ABRRIP Replacement Policies implementation on GEM5 Simulator, and its testing using SPEC06 Benchmark Suite.
The Java Virtual Machine (JVM) can deliver significantly better performance through the use of Just In Time compilation. However, each time you start an application it needs to repeat the same process of analysis and compilation. This session discusses Java with Co-ordinated Checkpoint at Restore. This is a way to freeze an application and start it again (potentially many times) from the same checkpoint.
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
The second-generation Intel® Xeon Phi™ processor offers new and enhanced features that provide significant performance gains in modernized code. For this lab, we pair these features with Intel® Software Development Products and methodologies to enable developers to gain insights on application behavior and to find opportunities to optimize parallelism, memory, and vectorization features.
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation, Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10., Volume 25, Numbers 1-2, pp. 25-32, DOI:10.1007/s00450-010-0108-2 (May. 2010).
http://ito-lab.naist.jp/~n-sibata/pdfs/isc10simd.pdf
http://freecode.com/projects/sleef
Data-parallel architectures like SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include double-precision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table look-ups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table look-ups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.
RTOS Material hfffffffffffffffffffffffffffffffffffff
Mpage Scicomp14 Bynd Tsk Geom
1. Beyond
Task Geometry
Mike Page
ScicomP 14
Poughkeepsie, New York
May 22, 2008
NCAR/CISL/HSS
Consulting Services Group
mpage@ucar.edu
2. NCAR CCSM with Task
Geometry Support in LSF
Mike Page
ScicomP 11 Conference
Edinburgh, Scotland
June 1, 2005
NCAR/CISL/SCD
Consulting Services Group
mpage@ucar.edu
3. Description of CCSM3
Concurrent Model
with Version 6 coupler
Requires TASK_GEOMETRY support
in the batch management subsystem if any of the
components run in hybrid mode
4. Coupler
do i=1,ndays ! days to run
CCSM3 with Cpl6 do j=1,24 ! hours
if (j.eq.1) call ocn_send()
Concurrency of Components call lnd_send()
call ice_send()
call ice_recv()
OCN call lnd_recv()
call atm_send()
if (j.eq.24) call ocn_recv()
call atm_recv()
ATM enddo
enddo
LND General Physical Component
do i=1,ndays
do j=1,24
ICE call compute_stuff_1()
call cpl_recv()
call compute_stuff_2()
CPL call cpl_send()
call compute_stuff_3()
Simulated Day enddo
enddo
Busy Idle
Courtesy Jon Wolfe
5. Coupler
CCSM3 with Cpl6 do i=1,ndays ! days to run
do j=1,24 ! hours
Concurrency of Components if (j.eq.1) call ocn_send()
call lnd_send()
call ice_send()
call ice_recv()
OCN call lnd_recv()
call atm_send()
if (j.eq.24) call ocn_recv()
call atm_recv()
ATM enddo
enddo
LND General Physical Component
do i=1,ndays
do j=1,24
ICE call compute_stuff_1()
call cpl_recv()
call compute_stuff_2()
CPL call cpl_send()
call compute_stuff_3()
Simulated Day enddo
enddo
Busy Idle
Courtesy Jon Wolfe
6. Features and Issues of
Concurrent Applications
• Features
• Plug-in/Plug-out components
• Good paradigm for multiphysics, multiscale models
• Not just climate models
• Issues
• Load Balancing/Efficiency
• Performance depends on the slowest individual component
• Matching resource allocation to the computational domains
of components can aggravate load balance issues
• Compounded by increasing processor count in new and
future systems?
• Portability
• Task Geometry not supported by all systems
• Other vendor-specific functionality
7. Working Around the Issues, Retaining
the Features of Concurrent Applications
• Load Balancing
• Refactor the way that the coupler coordinates
communications and component execution
• Concurrent execution (cpl6)
• Hybrid sequential/concurrent (cpl7)
• May still face load balance issues
• Sequential execution of components (cpl7)
• Depends on uniformity of scaling
• Portability
• Eliminate need for Task Geometry
• Everything MPI ?
• Everything Hybrid ?
• Are other methods possible ?
• Avoid vendor-specific features
8. Refactoring the Coupler
It Helps to Look at the Problem Sideways
PE PE PE PE PE
Set 1 Set 2 Set 3 Set 4 Set 5
T
i
m
e
9. Rethinking the CCSM3 Coupler
CPL6 -> CPL7 + DRIVER
Current Single Executable Concurrent CCSM
CAM CLM CICE POP CPL
Sequential CCSM
DRIVER
No Task Geometry
CPL
required if all
CAM
components are
CLM pure MPI
CICE
POP
Hybrid Sequential/Concurrent CCSM
DRIVER Vary the task
CPL configuration if
CAM scalability is
POP
CLM uneven to improve
CICE load balance
Courtesy John Dennis
10. Is it possible, in this application model,
to get around the all-hybrid/all-
mpi/Task Geometry requirement(s)?
How about using both full-mpi and
hybrid in a single component?
i.e., is it possible to switch between
mpi and hybrid computational modes
across or within the same program
module?
11. To rephrase and augment the question:
Can code like this
• run across multiple SMP nodes?
• exhibit good performance, efficiency and portability?
Some_Main_or_Subroutine
.
.
Loop
.
.
call compute_something_by_mpi
.
.
call compute_something_by_hyb
.
.
End Loop
Experiments so far are encouraging
12. Implementation of heterogeneous full-mpi/hybrid
computation in a sequential system
1) Create multiple MPI communicators
• Default communicator
• Communicator for MPI computations
• Same task count as default communicator
• Communicator for hybrid computations
• num_hyb_threads = OMP_NUM_THREADS (from
environment)
• Include every OMP_NUM_THREADSth task
from default communicator
• Loop
• MPI computations
• Set OMP_NUM_THREADS=1
• All tasks call compute_something_by_mpi
• MPI_BARRIER (default communicator)
• Hybrid computations
• Set OMP_NUM_THREADS = num_hyb_threads
• If task is a member of the hybrid communicator,
call compute_something_by_hyb
• MPI_BARRIER (default communicator)
• End loop (extras points: make 2a and 2b call same
13. Experiment in heterogeneous
full-mpi/hybrid computation on AIX
- Findings -
It is critical to force unused MPI tasks to idle at the
mpi_barrier and wait for OMP computations to
complete. Initial runs showed MPI tasks at the
mpi_barrier in the hybrid computation consuming
about 20% of the cpu cycles needed by the active
OMP threads. This seriously degraded performance of
the hybrid computations.
Early attempts of the implementation used mp_flush
and/or sleep to force unused MPI tasks to fully idle.
mp_flush is non-portable.
sleep is non-portable and it also not easy to predict
how long an idle MPI task needs to sleep.
14. Experiment in heterogeneous
full-mpi/hybrid computation on AIX
Workarounds
(Many thanks to Robert Blackmore, IBM)
• Required AIX environment settings
• MP_WAIT_MODE=NOPOLL
• MP_CSS_INTERRUPT=YES
• NCAR requirements (bluevista)
• xlf 11.1 (?)
• Updated MPI library
Now the idle MPI tasks use
0.2% or less of available cycles
15. Test Results
(Simple and limited, so far)
Compute integral representation of pi
(2,147,483,647 terms), in pure mpi and hybrid
(4 omp threads/task) modes
Execution time (sec)
8-way SMP Nodes MPI Hybrid
1 35.40 35.54
35.40 35.53
35.35 35.50
35.49 35.53
2 18.09 17.85
18.01 17.85
18.27 17.87
18.14 17.89
3 12.20 11.95
12.77 12.04
12.05 12.22
11.96 11.95
4 9.83 9.02
9.07 9.11
9.66 9.18
9.70 9.27
16. Future Work
• Integrate more substantial computations
into this method
• Make MP_CSS_INTERUPT dynamic
• Explore other platforms for portability
• Counterparts to
• MP_WAIT_MODE=NOPOLL
• MP_CSS_INTERRUPT=YES
• More and more testing
Three years ago I gave this presentation describing how NCAR had requested support for Task Geometry in LSF so that we could continue to run CCSM as an MPMD implementation.
This shows how the 5 components of CCSM3 collaborate by passing data through a central coupler … hub and spoke style In MPMD codes, component applications can be pulled out and replaced with a different application as long as they obey interface rules. This implies that the programming model used for component application in an MPMD ensemble can be chosen without impacting the other components. If at least one component is itself a hybrid application then task geometry is required because the number of mpi tasks no longer matches the number of processors required. Task geometry can be used with pure mpi models without any ill effects. It just reinforces the specification of the number of processors and number of processors per node that will be in use. In CCSM the usual atm and lnd components are hybrid models while the others are pure MPI. The hybrid implementation is preferred for performance reasons.
This slide shows how the component applications interact by passing data during the course of one day of simulated time. Note the amount of idel time incurred by some components while waiting for results from another component to be passed through the coupler. This is a case in which computational load is unbalanced. The imbalance can be the result of poor allocation of computational resources or requirements that the computational grid of one or more of the components imposes on the allocation of computational resources. Finding an efficient processor layout (fed into Task Geometry) can require some experimentation.
This slide shows how the component applications interact by passing data during the course of one day of simulated time. Note the amount of idel time incurred by some components while waiting for results from another component to be passed through the coupler. This is a case in which computational load is unbalanced. The imbalance can be the result of poor allocation of computational resources or requirements that the computational grid of one or more of the components imposes on the allocation of computational resources. Finding an efficient processor layout (fed into Task Geometry) can require some experimentation.
Issues and benefits of the MPMD model.
I think that the MPMD versus SPMD separation is confusing and really is not the main point. The main point is that in cpl6 we are running all the components on disjoint processor sets and this will only perform well if there is a great deal of concurrency that the science permits. This was somewhat the case for ccsm3 and is really not the case for land/cice/atm for ccsm4. Therefore the cpl6 architecture is very limiting. The cpl7 architecture gives you a lot more flexibility. Any MPMD application can be transformed into an SPMD application.
This slide shows how the component applications interact by passing data during the course of one day of simulated time. Note the amount of idel time incurred by some components while waiting for results from another component to be passed through the coupler. This is a case in which computational load is unbalanced. The imbalance can be the result of poor allocation of computational resources or requirements that the computational grid of one or more of the components imposes on the allocation of computational resources. Finding an efficient processor layout (fed into Task Geometry) can require some experimentation.
I would like to redo this slide - since the cpl should really be labeled the driver and the cpl itself is yet one other component. This has not been important in talks to scientists - but it might be more important in a compute science oriented talk. This is what is being investigated for a new implementation of the CCSM coupler. Two ideas have surfaced … a sequential coupler and a hybrid sequential/concurrent coupler.
MV - again the goal is not MPMD to SPMD transformation. The goal is to leverage heterogeneous hybrid/pure-mpi transitions just using communicators (and not task geometry) for those components that are running sequentially. We can still leverage task geometry to split pure-mpi versus hybrid in the current cpl7 system for those components that are running concurrently. I don’t think that this point is coming out clearly in your talk. The answer to 2 waits for some experimentation but it’s probably “yes”. I’ve been able to change the answer to 1 from “no” to “well, maybe not”.
MV - should include a driver in the above. This is the most restrictive mode of running cp6. Furthermore, I do not think that the important issue here is MPMD->SPMD transformation but rather full concurrent (on disjoint processors) -> full sequential on same processor performance considerations. How do you leverage optimal performance in a sequential system, when some components run better in hybrid mode and others run better in full-mpi mode? That is the main question. I call this “layered” programming. The example I’ve worked up is still under test.
I would call this title - Implementation of heterogeneous hybrid/full-mpi in a sequential system. Goal is to do this without task geometry. Again MPMD to SPMD is confusing and really not important in this case. Step1a: Create Multiple MPI Communicators (one for each component) Is task_count_wrld the number of components you are using? I call this “layered” programming. The example I’ve worked up is still under test.