Progress Toward Accelerating CAM-SE

  • 385 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
385
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Added outlines to show how these bars relate to our kernels. Edge Packing & Unpacking are part of the “Boundary Exchange”. Designed for maximum MPI scaling with one element per task and one task per core. Need to redesign for smaller number of more powerful nodes and lower surface/volume ratio.Verremap2 is the “Vertical Remap”“Euler Step” consists of euler_step, divergence_spere, and limiter2d_zeroNOTE: In the application, a boundary exchange occurs inside of euler_step“Laplace Sphere Weak” is a call to divergence_sphere_wk and gradient_sphere

Transcript

  • 1. Progress Toward Accelerating CAM-SE.
    Jeff Larkin <larkin@cray.com>
    Along with:
    Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor
  • 2. Background
    In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012?
    Answer to come on next slide
    Center for Accelerated Application Research (CAAR) established to determine:
    Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendors
    Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere
  • 3. CAM-SE Target Problem
    1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry.
    Why is acceleration needed to “do” the problem?
    When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems.
    What unrealized parallelism needs to be exposed?
    In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).
  • 4. Profile of Runtime
    % of Runtime
  • 5. Next Steps
    Once the dominant routines were identified, standalone kernels were created for each.
    Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL
    Directives-based compiler were too immature at the time
    Poor support for Fortran modules and derived types
    Did not allow implementation at a high enough level
    CUDA Fortran provided good performance while allowing us to remain in Fortran
  • 6. Identifying Parallelism
    HOMME parallelizes both MPI and OpenMP over elements
    Most of the tracer advection can also parallelize over tracers (q) and levels (k)
    Vertical remap is the exception, due to vertical dependence in levels.
    Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.
  • 7. Status
    Euler_step & laplace_sphere_wk were straightforward to rewrite in CUDA Fortran
    Vertical Remap was rewritten to be more amenable to GPU (made it vectorize)
    Resulting code is 2X faster on CPU than original code and has been given back to the community
    Edge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already)
    Designed for 1 element per MPI rank, but we plan to run with more
    Once this is node-aware, it can also be device-aware and greatly reduce PCIe transfers
    Someone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”
  • 8. Status (cont.)
    Kernels were put back into HOMME and validation tests were run and passed
    This version did nothing to reduce data movement, only tested kernel accuracy
    In process of porting forward to current trunk and do more intelligent data movement
    Currently reevaluating directives now that compilers have matured
    Directives-based vertical remap now slightly outperforms hand-tuned CUDA
    Still working around derived_type issues
  • 9. Challenges
    Data Structures (Object-Oriented Fortran)
    Every node has an array of element derived types, which contains more arrays
    We only care about some of these arrays, so data movement isn’t very natural
    We must essentially change many non-contiguous CPU arrays into a contiguous GPU array
    Parallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directives
    Cray compiler handles this via whole program analysis, PGI compiler may support this via inline library
  • 10. Challenges (cont.)
    CUDA Fortran requires everything live in the same module
    Must duplicate some routines and data structures from several module in our “cuda_mod”
    Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routines
    Simple for user, but developer must maintain duplicate routines
    Hey Dave, when will this get changed? ;)
  • 11. Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.
  • 12. With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.
  • 13. Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.
  • 14. Future Work
    Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance
    Move forward to CAM5/CESM1
    No chance of our work being used otherwise
    Some additional, small kernels are needed to allow data to remain resident
    Cheaper to run these on the GPU than to copy the data
    Reprofile with accelerated application to identify next most important routines
    Chemisty implicit solver is expected to be next
    Physics is expected to require mature, directives-based compiler
    Rinse, repeat
  • 15. Conclusions
    Much has been done, much remains
    For a fairly new, cleanly written code, CUDA Fortran was tractable.
    HOMME has very similar loop nests throughout, that was key to making this possible
    Still results in multiple code paths to maintain, so we’d prefer to move to directives for the long-run
    We believe GPU accelerators will be beneficial for the selected problem
    We hope that it will also benefit a wider audience (CAM5 should help this)