SlideShare a Scribd company logo
Beyond
  Task Geometry


       Mike Page
       ScicomP 14
 Poughkeepsie, New York
      May 22, 2008
    NCAR/CISL/HSS
Consulting Services Group
    mpage@ucar.edu
NCAR CCSM with Task
Geometry Support in LSF
             Mike Page
       ScicomP 11 Conference
        Edinburgh, Scotland
            June 1, 2005
          NCAR/CISL/SCD
      Consulting Services Group
          mpage@ucar.edu
Description of CCSM3
          Concurrent Model
        with Version 6 coupler




       Requires TASK_GEOMETRY support
in the batch management subsystem if any of the
         components run in hybrid mode
Coupler
                                do i=1,ndays ! days to run
      CCSM3 with Cpl6            do j=1,24       ! hours
                                  if (j.eq.1) call ocn_send()
  Concurrency of Components       call lnd_send()
                                      call ice_send()
                                      call ice_recv()
OCN                               call lnd_recv()
                                  call atm_send()
                                  if (j.eq.24) call ocn_recv()
                                  call atm_recv()
ATM                              enddo
                                enddo

LND                             General Physical Component
                                do i=1,ndays
                                 do j=1,24
ICE                               call compute_stuff_1()
                                  call cpl_recv()
                                  call compute_stuff_2()
CPL                               call cpl_send()
                                  call compute_stuff_3()
         Simulated Day           enddo
                                enddo


            Busy         Idle

                                              Courtesy Jon Wolfe
Coupler

      CCSM3 with Cpl6           do i=1,ndays ! days to run
                                 do j=1,24       ! hours

  Concurrency of Components       if (j.eq.1) call ocn_send()
                                  call lnd_send()
                                      call ice_send()
                                      call ice_recv()
OCN                               call lnd_recv()
                                  call atm_send()
                                  if (j.eq.24) call ocn_recv()
                                  call atm_recv()
ATM                              enddo
                                enddo

LND                             General Physical Component
                                do i=1,ndays
                                 do j=1,24
ICE                               call compute_stuff_1()
                                  call cpl_recv()
                                  call compute_stuff_2()
CPL                               call cpl_send()
                                  call compute_stuff_3()
         Simulated Day           enddo
                                enddo


            Busy         Idle

                                              Courtesy Jon Wolfe
Features and Issues of
       Concurrent Applications
• Features
  • Plug-in/Plug-out components
  • Good paradigm for multiphysics, multiscale models
      • Not just climate models
• Issues
  • Load Balancing/Efficiency
      • Performance depends on the slowest individual component
      • Matching resource allocation to the computational domains
        of components can aggravate load balance issues
      • Compounded by increasing processor count in new and
        future systems?
  • Portability
      • Task Geometry not supported by all systems
      • Other vendor-specific functionality
Working Around the Issues, Retaining
the Features of Concurrent Applications
• Load Balancing
   • Refactor the way that the coupler coordinates
       communications and component execution
       • Concurrent execution (cpl6)
       • Hybrid sequential/concurrent (cpl7)
            • May still face load balance issues
       • Sequential execution of components (cpl7)
            • Depends on uniformity of scaling
• Portability
   • Eliminate need for Task Geometry
       • Everything MPI ?
       • Everything Hybrid ?
       • Are other methods possible ?
   • Avoid vendor-specific features
Refactoring the Coupler
            It Helps to Look at the Problem Sideways
     PE         PE           PE            PE           PE
    Set 1      Set 2        Set 3         Set 4        Set 5




T
i
m
e
Rethinking the CCSM3 Coupler
   CPL6 -> CPL7 + DRIVER
Current Single Executable Concurrent CCSM
  CAM       CLM      CICE      POP      CPL



 Sequential CCSM

                   DRIVER
                                              No Task Geometry
                     CPL
                                              required if all
                    CAM
                                              components are
                    CLM                       pure MPI
                    CICE

                     POP

Hybrid Sequential/Concurrent CCSM

                   DRIVER                     Vary the task
           CPL                                configuration if
           CAM                                scalability is
                                 POP
           CLM                                uneven to improve
           CICE                               load balance

                                                      Courtesy John Dennis
Is it possible, in this application model,
      to get around the all-hybrid/all-
 mpi/Task Geometry requirement(s)?

  How about using both full-mpi and
   hybrid in a single component?

i.e., is it possible to switch between
mpi and hybrid computational modes
 across or within the same program
                module?
To rephrase and augment the question:
Can code like this
• run across multiple SMP nodes?
• exhibit good performance, efficiency and portability?


         Some_Main_or_Subroutine
         .
         .
           Loop
         .
         .
           call compute_something_by_mpi
         .
         .
           call compute_something_by_hyb
         .
         .
           End Loop



              Experiments so far are encouraging
Implementation of heterogeneous full-mpi/hybrid
      computation in a sequential system
    1) Create multiple MPI communicators
        • Default communicator
        • Communicator for MPI computations
            • Same task count as default communicator
        • Communicator for hybrid computations
            • num_hyb_threads = OMP_NUM_THREADS (from
               environment)
            • Include every OMP_NUM_THREADSth task
               from default communicator
    • Loop
        • MPI computations
            • Set OMP_NUM_THREADS=1
            • All tasks call compute_something_by_mpi
            • MPI_BARRIER (default communicator)
        • Hybrid computations
            • Set OMP_NUM_THREADS = num_hyb_threads
            • If task is a member of the hybrid communicator,
                      call compute_something_by_hyb
            • MPI_BARRIER (default communicator)
    • End loop                           (extras points: make 2a and 2b call same
Experiment in heterogeneous
full-mpi/hybrid computation on AIX
            - Findings -
It is critical to force unused MPI tasks to idle at the
mpi_barrier and wait for OMP computations to
complete. Initial runs showed MPI tasks at the
mpi_barrier in the hybrid computation consuming
about 20% of the cpu cycles needed by the active
OMP threads. This seriously degraded performance of
the hybrid computations.

Early attempts of the implementation used mp_flush
and/or sleep to force unused MPI tasks to fully idle.

mp_flush is non-portable.

sleep is non-portable and it also not easy to predict
how long an idle MPI task needs to sleep.
Experiment in heterogeneous
full-mpi/hybrid computation on AIX

           Workarounds
        (Many thanks to Robert Blackmore, IBM)



    • Required AIX environment settings
       • MP_WAIT_MODE=NOPOLL
       • MP_CSS_INTERRUPT=YES
    • NCAR requirements (bluevista)
       • xlf 11.1 (?)
       • Updated MPI library

         Now the idle MPI tasks use
       0.2% or less of available cycles
Test Results
          (Simple and limited, so far)

    Compute integral representation of pi
(2,147,483,647 terms), in pure mpi and hybrid
         (4 omp threads/task) modes
                         Execution time (sec)
       8-way SMP Nodes    MPI          Hybrid
               1          35.40          35.54
                          35.40          35.53
                          35.35          35.50
                          35.49          35.53

              2           18.09          17.85
                          18.01          17.85
                          18.27          17.87
                          18.14          17.89

              3           12.20          11.95
                          12.77          12.04
                          12.05          12.22
                          11.96          11.95

              4            9.83           9.02
                           9.07           9.11
                           9.66           9.18
                           9.70           9.27
Future Work
• Integrate more substantial computations
       into this method
• Make MP_CSS_INTERUPT dynamic
• Explore other platforms for portability
   • Counterparts to
       • MP_WAIT_MODE=NOPOLL
       • MP_CSS_INTERRUPT=YES
• More and more testing
?
Mpage Scicomp14 Bynd Tsk Geom

More Related Content

What's hot

A synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaA synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaUniversidad Carlos III de Madrid
 
Concurrent Root Cut Loops to Exploit Random Performance Variability
Concurrent Root Cut Loops to Exploit Random Performance VariabilityConcurrent Root Cut Loops to Exploit Random Performance Variability
Concurrent Root Cut Loops to Exploit Random Performance Variability
IBM Decision Optimization
 
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
Luc Lesoil
 
job scheduling
job schedulingjob scheduling
job scheduling
Arpit Tiwari
 
Interesting Use Cases for the CPLEX Remote Object
Interesting Use Cases for the CPLEX Remote ObjectInteresting Use Cases for the CPLEX Remote Object
Interesting Use Cases for the CPLEX Remote Object
IBM Decision Optimization
 
Design and Implementation of Discrete Augmented Ziegler-Nichols PID Controller
Design and Implementation of Discrete Augmented Ziegler-Nichols PID ControllerDesign and Implementation of Discrete Augmented Ziegler-Nichols PID Controller
Design and Implementation of Discrete Augmented Ziegler-Nichols PID Controller
IDES Editor
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Mixed Integer Programming: Analyzing 12 Years of Progress
Mixed Integer Programming: Analyzing 12 Years of ProgressMixed Integer Programming: Analyzing 12 Years of Progress
Mixed Integer Programming: Analyzing 12 Years of Progress
IBM Decision Optimization
 
Final Exam OS fall 2012-2013 with answers
Final Exam OS fall 2012-2013 with answersFinal Exam OS fall 2012-2013 with answers
Final Exam OS fall 2012-2013 with answers
Arab Open University and Cairo University
 
Robust Repositioning in Large-scale Networks
Robust Repositioning in Large-scale NetworksRobust Repositioning in Large-scale Networks
Robust Repositioning in Large-scale Networks
Alan Erera
 
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)
Brenno Menezes
 
An35225228
An35225228An35225228
An35225228
IJERA Editor
 
Recent Advances in CPLEX 12.6.1
Recent Advances in CPLEX 12.6.1Recent Advances in CPLEX 12.6.1
Recent Advances in CPLEX 12.6.1
IBM Decision Optimization
 
Presentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingPresentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingAntonio Maria Fiscarelli
 
Siddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertationSiddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertation
SiddhiKadam10
 

What's hot (17)

A synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time JavaA synchronous scheduling service for distributed real-time Java
A synchronous scheduling service for distributed real-time Java
 
Concurrent Root Cut Loops to Exploit Random Performance Variability
Concurrent Root Cut Loops to Exploit Random Performance VariabilityConcurrent Root Cut Loops to Exploit Random Performance Variability
Concurrent Root Cut Loops to Exploit Random Performance Variability
 
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
 
job scheduling
job schedulingjob scheduling
job scheduling
 
Interesting Use Cases for the CPLEX Remote Object
Interesting Use Cases for the CPLEX Remote ObjectInteresting Use Cases for the CPLEX Remote Object
Interesting Use Cases for the CPLEX Remote Object
 
No Heap Remote Objects for Distributed real-time Java
No Heap Remote Objects for Distributed real-time JavaNo Heap Remote Objects for Distributed real-time Java
No Heap Remote Objects for Distributed real-time Java
 
Ch5 answers
Ch5 answersCh5 answers
Ch5 answers
 
Design and Implementation of Discrete Augmented Ziegler-Nichols PID Controller
Design and Implementation of Discrete Augmented Ziegler-Nichols PID ControllerDesign and Implementation of Discrete Augmented Ziegler-Nichols PID Controller
Design and Implementation of Discrete Augmented Ziegler-Nichols PID Controller
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Mixed Integer Programming: Analyzing 12 Years of Progress
Mixed Integer Programming: Analyzing 12 Years of ProgressMixed Integer Programming: Analyzing 12 Years of Progress
Mixed Integer Programming: Analyzing 12 Years of Progress
 
Final Exam OS fall 2012-2013 with answers
Final Exam OS fall 2012-2013 with answersFinal Exam OS fall 2012-2013 with answers
Final Exam OS fall 2012-2013 with answers
 
Robust Repositioning in Large-scale Networks
Robust Repositioning in Large-scale NetworksRobust Repositioning in Large-scale Networks
Robust Repositioning in Large-scale Networks
 
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)
FEEDSTOCK STORAGE ASSIGNMENT IN PROCESS INDUSTRY QUALITY PROBLEMS (Poster)
 
An35225228
An35225228An35225228
An35225228
 
Recent Advances in CPLEX 12.6.1
Recent Advances in CPLEX 12.6.1Recent Advances in CPLEX 12.6.1
Recent Advances in CPLEX 12.6.1
 
Presentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop schedulingPresentation_Parallel GRASP algorithm for job shop scheduling
Presentation_Parallel GRASP algorithm for job shop scheduling
 
Siddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertationSiddhi kadam, MTech dissertation
Siddhi kadam, MTech dissertation
 

Similar to Mpage Scicomp14 Bynd Tsk Geom

Thesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides ENThesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides ENMarco Santambrogio
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
Oğuzhan TAŞ Akademi
 
Full introduction to_parallel_computing
Full introduction to_parallel_computingFull introduction to_parallel_computing
Full introduction to_parallel_computingSupasit Kajkamhaeng
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
Masashi Shibata
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
Java On CRaC
Java On CRaCJava On CRaC
Java On CRaC
Simon Ritter
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Yusuke Izawa
 
Design & Analysis of Algorithm course .pptx
Design & Analysis of Algorithm course .pptxDesign & Analysis of Algorithm course .pptx
Design & Analysis of Algorithm course .pptx
JeevaMCSEKIOT
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
yaman dua
 
RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingMarco Santambrogio
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Intel® Software
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
Ralf Gommers
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
Naoki Shibata
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
Yi-Hsiu Hsu
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET Journal
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffff
adugnanegero
 

Similar to Mpage Scicomp14 Bynd Tsk Geom (20)

Thesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides ENThesis F. Redaelli UIC Slides EN
Thesis F. Redaelli UIC Slides EN
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Full introduction to_parallel_computing
Full introduction to_parallel_computingFull introduction to_parallel_computing
Full introduction to_parallel_computing
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Java On CRaC
Java On CRaCJava On CRaC
Java On CRaC
 
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
 
Design & Analysis of Algorithm course .pptx
Design & Analysis of Algorithm course .pptxDesign & Analysis of Algorithm course .pptx
Design & Analysis of Algorithm course .pptx
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello Scheduling
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
Introduction to memory order consume
Introduction to memory order consumeIntroduction to memory order consume
Introduction to memory order consume
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
RTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffffRTOS Material hfffffffffffffffffffffffffffffffffffff
RTOS Material hfffffffffffffffffffffffffffffffffffff
 

Mpage Scicomp14 Bynd Tsk Geom

  • 1. Beyond Task Geometry Mike Page ScicomP 14 Poughkeepsie, New York May 22, 2008 NCAR/CISL/HSS Consulting Services Group mpage@ucar.edu
  • 2. NCAR CCSM with Task Geometry Support in LSF Mike Page ScicomP 11 Conference Edinburgh, Scotland June 1, 2005 NCAR/CISL/SCD Consulting Services Group mpage@ucar.edu
  • 3. Description of CCSM3 Concurrent Model with Version 6 coupler Requires TASK_GEOMETRY support in the batch management subsystem if any of the components run in hybrid mode
  • 4. Coupler do i=1,ndays ! days to run CCSM3 with Cpl6 do j=1,24 ! hours if (j.eq.1) call ocn_send() Concurrency of Components call lnd_send() call ice_send() call ice_recv() OCN call lnd_recv() call atm_send() if (j.eq.24) call ocn_recv() call atm_recv() ATM enddo enddo LND General Physical Component do i=1,ndays do j=1,24 ICE call compute_stuff_1() call cpl_recv() call compute_stuff_2() CPL call cpl_send() call compute_stuff_3() Simulated Day enddo enddo Busy Idle Courtesy Jon Wolfe
  • 5. Coupler CCSM3 with Cpl6 do i=1,ndays ! days to run do j=1,24 ! hours Concurrency of Components if (j.eq.1) call ocn_send() call lnd_send() call ice_send() call ice_recv() OCN call lnd_recv() call atm_send() if (j.eq.24) call ocn_recv() call atm_recv() ATM enddo enddo LND General Physical Component do i=1,ndays do j=1,24 ICE call compute_stuff_1() call cpl_recv() call compute_stuff_2() CPL call cpl_send() call compute_stuff_3() Simulated Day enddo enddo Busy Idle Courtesy Jon Wolfe
  • 6. Features and Issues of Concurrent Applications • Features • Plug-in/Plug-out components • Good paradigm for multiphysics, multiscale models • Not just climate models • Issues • Load Balancing/Efficiency • Performance depends on the slowest individual component • Matching resource allocation to the computational domains of components can aggravate load balance issues • Compounded by increasing processor count in new and future systems? • Portability • Task Geometry not supported by all systems • Other vendor-specific functionality
  • 7. Working Around the Issues, Retaining the Features of Concurrent Applications • Load Balancing • Refactor the way that the coupler coordinates communications and component execution • Concurrent execution (cpl6) • Hybrid sequential/concurrent (cpl7) • May still face load balance issues • Sequential execution of components (cpl7) • Depends on uniformity of scaling • Portability • Eliminate need for Task Geometry • Everything MPI ? • Everything Hybrid ? • Are other methods possible ? • Avoid vendor-specific features
  • 8. Refactoring the Coupler It Helps to Look at the Problem Sideways PE PE PE PE PE Set 1 Set 2 Set 3 Set 4 Set 5 T i m e
  • 9. Rethinking the CCSM3 Coupler CPL6 -> CPL7 + DRIVER Current Single Executable Concurrent CCSM CAM CLM CICE POP CPL Sequential CCSM DRIVER No Task Geometry CPL required if all CAM components are CLM pure MPI CICE POP Hybrid Sequential/Concurrent CCSM DRIVER Vary the task CPL configuration if CAM scalability is POP CLM uneven to improve CICE load balance Courtesy John Dennis
  • 10. Is it possible, in this application model, to get around the all-hybrid/all- mpi/Task Geometry requirement(s)? How about using both full-mpi and hybrid in a single component? i.e., is it possible to switch between mpi and hybrid computational modes across or within the same program module?
  • 11. To rephrase and augment the question: Can code like this • run across multiple SMP nodes? • exhibit good performance, efficiency and portability? Some_Main_or_Subroutine . . Loop . . call compute_something_by_mpi . . call compute_something_by_hyb . . End Loop Experiments so far are encouraging
  • 12. Implementation of heterogeneous full-mpi/hybrid computation in a sequential system 1) Create multiple MPI communicators • Default communicator • Communicator for MPI computations • Same task count as default communicator • Communicator for hybrid computations • num_hyb_threads = OMP_NUM_THREADS (from environment) • Include every OMP_NUM_THREADSth task from default communicator • Loop • MPI computations • Set OMP_NUM_THREADS=1 • All tasks call compute_something_by_mpi • MPI_BARRIER (default communicator) • Hybrid computations • Set OMP_NUM_THREADS = num_hyb_threads • If task is a member of the hybrid communicator, call compute_something_by_hyb • MPI_BARRIER (default communicator) • End loop (extras points: make 2a and 2b call same
  • 13. Experiment in heterogeneous full-mpi/hybrid computation on AIX - Findings - It is critical to force unused MPI tasks to idle at the mpi_barrier and wait for OMP computations to complete. Initial runs showed MPI tasks at the mpi_barrier in the hybrid computation consuming about 20% of the cpu cycles needed by the active OMP threads. This seriously degraded performance of the hybrid computations. Early attempts of the implementation used mp_flush and/or sleep to force unused MPI tasks to fully idle. mp_flush is non-portable. sleep is non-portable and it also not easy to predict how long an idle MPI task needs to sleep.
  • 14. Experiment in heterogeneous full-mpi/hybrid computation on AIX Workarounds (Many thanks to Robert Blackmore, IBM) • Required AIX environment settings • MP_WAIT_MODE=NOPOLL • MP_CSS_INTERRUPT=YES • NCAR requirements (bluevista) • xlf 11.1 (?) • Updated MPI library Now the idle MPI tasks use 0.2% or less of available cycles
  • 15. Test Results (Simple and limited, so far) Compute integral representation of pi (2,147,483,647 terms), in pure mpi and hybrid (4 omp threads/task) modes Execution time (sec) 8-way SMP Nodes MPI Hybrid 1 35.40 35.54 35.40 35.53 35.35 35.50 35.49 35.53 2 18.09 17.85 18.01 17.85 18.27 17.87 18.14 17.89 3 12.20 11.95 12.77 12.04 12.05 12.22 11.96 11.95 4 9.83 9.02 9.07 9.11 9.66 9.18 9.70 9.27
  • 16. Future Work • Integrate more substantial computations into this method • Make MP_CSS_INTERUPT dynamic • Explore other platforms for portability • Counterparts to • MP_WAIT_MODE=NOPOLL • MP_CSS_INTERRUPT=YES • More and more testing
  • 17. ?

Editor's Notes

  1. Three years ago I gave this presentation describing how NCAR had requested support for Task Geometry in LSF so that we could continue to run CCSM as an MPMD implementation.
  2. This shows how the 5 components of CCSM3 collaborate by passing data through a central coupler … hub and spoke style In MPMD codes, component applications can be pulled out and replaced with a different application as long as they obey interface rules. This implies that the programming model used for component application in an MPMD ensemble can be chosen without impacting the other components. If at least one component is itself a hybrid application then task geometry is required because the number of mpi tasks no longer matches the number of processors required. Task geometry can be used with pure mpi models without any ill effects. It just reinforces the specification of the number of processors and number of processors per node that will be in use. In CCSM the usual atm and lnd components are hybrid models while the others are pure MPI. The hybrid implementation is preferred for performance reasons.
  3. This slide shows how the component applications interact by passing data during the course of one day of simulated time. Note the amount of idel time incurred by some components while waiting for results from another component to be passed through the coupler. This is a case in which computational load is unbalanced. The imbalance can be the result of poor allocation of computational resources or requirements that the computational grid of one or more of the components imposes on the allocation of computational resources. Finding an efficient processor layout (fed into Task Geometry) can require some experimentation.
  4. This slide shows how the component applications interact by passing data during the course of one day of simulated time. Note the amount of idel time incurred by some components while waiting for results from another component to be passed through the coupler. This is a case in which computational load is unbalanced. The imbalance can be the result of poor allocation of computational resources or requirements that the computational grid of one or more of the components imposes on the allocation of computational resources. Finding an efficient processor layout (fed into Task Geometry) can require some experimentation.
  5. Issues and benefits of the MPMD model.
  6. I think that the MPMD versus SPMD separation is confusing and really is not the main point. The main point is that in cpl6 we are running all the components on disjoint processor sets and this will only perform well if there is a great deal of concurrency that the science permits. This was somewhat the case for ccsm3 and is really not the case for land/cice/atm for ccsm4. Therefore the cpl6 architecture is very limiting. The cpl7 architecture gives you a lot more flexibility. Any MPMD application can be transformed into an SPMD application.
  7. This slide shows how the component applications interact by passing data during the course of one day of simulated time. Note the amount of idel time incurred by some components while waiting for results from another component to be passed through the coupler. This is a case in which computational load is unbalanced. The imbalance can be the result of poor allocation of computational resources or requirements that the computational grid of one or more of the components imposes on the allocation of computational resources. Finding an efficient processor layout (fed into Task Geometry) can require some experimentation.
  8. I would like to redo this slide - since the cpl should really be labeled the driver and the cpl itself is yet one other component. This has not been important in talks to scientists - but it might be more important in a compute science oriented talk. This is what is being investigated for a new implementation of the CCSM coupler. Two ideas have surfaced … a sequential coupler and a hybrid sequential/concurrent coupler.
  9. MV - again the goal is not MPMD to SPMD transformation. The goal is to leverage heterogeneous hybrid/pure-mpi transitions just using communicators (and not task geometry) for those components that are running sequentially. We can still leverage task geometry to split pure-mpi versus hybrid in the current cpl7 system for those components that are running concurrently. I don’t think that this point is coming out clearly in your talk. The answer to 2 waits for some experimentation but it’s probably “yes”. I’ve been able to change the answer to 1 from “no” to “well, maybe not”.
  10. MV - should include a driver in the above. This is the most restrictive mode of running cp6. Furthermore, I do not think that the important issue here is MPMD->SPMD transformation but rather full concurrent (on disjoint processors) -> full sequential on same processor performance considerations. How do you leverage optimal performance in a sequential system, when some components run better in hybrid mode and others run better in full-mpi mode? That is the main question. I call this “layered” programming. The example I’ve worked up is still under test.
  11. I would call this title - Implementation of heterogeneous hybrid/full-mpi in a sequential system. Goal is to do this without task geometry. Again MPMD to SPMD is confusing and really not important in this case. Step1a: Create Multiple MPI Communicators (one for each component) Is task_count_wrld the number of components you are using? I call this “layered” programming. The example I’ve worked up is still under test.