Mpage Scicomp14 Bynd Tsk Geom

Beyond
Task Geometry

Mike Page
ScicomP 14
Poughkeepsie, New York
May 22, 2008
NCAR/CISL/HSS
Consulting Services Group
mpage@ucar.edu

NCAR CCSM with Task
Geometry Support in LSF
Mike Page
ScicomP 11 Conference
Edinburgh, Scotland
June 1, 2005
NCAR/CISL/SCD
Consulting Services Group
mpage@ucar.edu

Description of CCSM3
Concurrent Model
with Version 6 coupler

Requires TASK_GEOMETRY support
in the batch management subsystem if any of the
components run in hybrid mode

Coupler
do i=1,ndays ! days to run
CCSM3 with Cpl6 do j=1,24 ! hours
if (j.eq.1) call ocn_send()
Concurrency of Components call lnd_send()
call ice_send()
call ice_recv()
OCN call lnd_recv()
call atm_send()
if (j.eq.24) call ocn_recv()
call atm_recv()
ATM enddo
enddo

LND General Physical Component
do i=1,ndays
do j=1,24
ICE call compute_stuff_1()
call cpl_recv()
call compute_stuff_2()
CPL call cpl_send()
Simulated Day enddo
enddo

Busy Idle

Courtesy Jon Wolfe

Coupler

CCSM3 with Cpl6 do i=1,ndays ! days to run
do j=1,24 ! hours

Concurrency of Components if (j.eq.1) call ocn_send()
call lnd_send()
call ice_send()
call ice_recv()
OCN call lnd_recv()
call atm_send()
if (j.eq.24) call ocn_recv()
call atm_recv()
ATM enddo
enddo

LND General Physical Component
do i=1,ndays
do j=1,24
ICE call compute_stuff_1()
call cpl_recv()
CPL call cpl_send()
Simulated Day enddo
enddo

Busy Idle

Courtesy Jon Wolfe

Features and Issues of
Concurrent Applications
• Features
• Plug-in/Plug-out components
• Good paradigm for multiphysics, multiscale models
• Not just climate models
• Issues
• Load Balancing/Efficiency
• Performance depends on the slowest individual component
• Matching resource allocation to the computational domains
of components can aggravate load balance issues
• Compounded by increasing processor count in new and
future systems?
• Portability
• Task Geometry not supported by all systems
• Other vendor-specific functionality

Working Around the Issues, Retaining
the Features of Concurrent Applications
• Load Balancing
• Refactor the way that the coupler coordinates
communications and component execution
• Concurrent execution (cpl6)
• Hybrid sequential/concurrent (cpl7)
• May still face load balance issues
• Sequential execution of components (cpl7)
• Depends on uniformity of scaling
• Portability
• Eliminate need for Task Geometry
• Everything MPI ?
• Everything Hybrid ?
• Are other methods possible ?
• Avoid vendor-specific features

Refactoring the Coupler
It Helps to Look at the Problem Sideways
PE PE PE PE PE
Set 1 Set 2 Set 3 Set 4 Set 5

T
i
m
e

Rethinking the CCSM3 Coupler
CPL6 -> CPL7 + DRIVER
Current Single Executable Concurrent CCSM
CAM CLM CICE POP CPL

Sequential CCSM

DRIVER
No Task Geometry
CPL
required if all
CAM
components are
CLM pure MPI
CICE

POP

Hybrid Sequential/Concurrent CCSM

DRIVER Vary the task
CPL configuration if
CAM scalability is
POP
CLM uneven to improve
CICE load balance

Courtesy John Dennis

Is it possible, in this application model,
to get around the all-hybrid/all-
mpi/Task Geometry requirement(s)?

How about using both full-mpi and
hybrid in a single component?

i.e., is it possible to switch between
mpi and hybrid computational modes
across or within the same program
module?

To rephrase and augment the question:
Can code like this
• run across multiple SMP nodes?
• exhibit good performance, efficiency and portability?

Some_Main_or_Subroutine
.
.
Loop
.
.
call compute_something_by_mpi
.
.
call compute_something_by_hyb
.
.
End Loop

Experiments so far are encouraging

Implementation of heterogeneous full-mpi/hybrid
computation in a sequential system
1) Create multiple MPI communicators
• Default communicator
• Communicator for MPI computations
• Same task count as default communicator
• Communicator for hybrid computations
• num_hyb_threads = OMP_NUM_THREADS (from
environment)
• Include every OMP_NUM_THREADSth task
from default communicator
• Loop
• MPI computations
• Set OMP_NUM_THREADS=1
• All tasks call compute_something_by_mpi
• MPI_BARRIER (default communicator)
• Hybrid computations
• Set OMP_NUM_THREADS = num_hyb_threads
• If task is a member of the hybrid communicator,
call compute_something_by_hyb
• MPI_BARRIER (default communicator)
• End loop (extras points: make 2a and 2b call same

Experiment in heterogeneous
full-mpi/hybrid computation on AIX
- Findings -
It is critical to force unused MPI tasks to idle at the
mpi_barrier and wait for OMP computations to
complete. Initial runs showed MPI tasks at the
mpi_barrier in the hybrid computation consuming
about 20% of the cpu cycles needed by the active
OMP threads. This seriously degraded performance of
the hybrid computations.

Early attempts of the implementation used mp_flush
and/or sleep to force unused MPI tasks to fully idle.

mp_flush is non-portable.

sleep is non-portable and it also not easy to predict
how long an idle MPI task needs to sleep.

Experiment in heterogeneous
full-mpi/hybrid computation on AIX

Workarounds
(Many thanks to Robert Blackmore, IBM)

• Required AIX environment settings
• MP_WAIT_MODE=NOPOLL
• MP_CSS_INTERRUPT=YES
• NCAR requirements (bluevista)
• xlf 11.1 (?)
• Updated MPI library

Now the idle MPI tasks use
0.2% or less of available cycles

Test Results
(Simple and limited, so far)

Compute integral representation of pi
(2,147,483,647 terms), in pure mpi and hybrid
(4 omp threads/task) modes
Execution time (sec)
8-way SMP Nodes MPI Hybrid
1 35.40 35.54
35.40 35.53
35.35 35.50
35.49 35.53

2 18.09 17.85
18.01 17.85
18.27 17.87
18.14 17.89

3 12.20 11.95
12.77 12.04
12.05 12.22
11.96 11.95

4 9.83 9.02
9.07 9.11
9.66 9.18
9.70 9.27

Future Work
• Integrate more substantial computations
into this method
• Make MP_CSS_INTERUPT dynamic
• Explore other platforms for portability
• Counterparts to
• MP_WAIT_MODE=NOPOLL
• MP_CSS_INTERRUPT=YES
• More and more testing

Mpage Scicomp14 Bynd Tsk Geom

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Mpage Scicomp14 Bynd Tsk Geom

Similar to Mpage Scicomp14 Bynd Tsk Geom (20)

Mpage Scicomp14 Bynd Tsk Geom

Editor's Notes