Progress Toward Accelerating CAM-SE


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Progress Toward Accelerating CAM-SE

  1. 1. Progress Toward Accelerating CAM-SE.<br />Jeff Larkin <><br />Along with:<br />Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor<br />
  2. 2. Background<br />In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012?<br />Answer to come on next slide<br />Center for Accelerated Application Research (CAAR) established to determine:<br />Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendors<br />Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere<br />
  3. 3. CAM-SE Target Problem<br />1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry. <br />Why is acceleration needed to “do” the problem?<br />When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems. <br />What unrealized parallelism needs to be exposed?<br />In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q). <br />
  4. 4. Profile of Runtime<br />% of Runtime<br />
  5. 5. Next Steps<br />Once the dominant routines were identified, standalone kernels were created for each.<br />Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL<br />Directives-based compiler were too immature at the time<br />Poor support for Fortran modules and derived types<br />Did not allow implementation at a high enough level<br />CUDA Fortran provided good performance while allowing us to remain in Fortran<br />
  6. 6. Identifying Parallelism<br />HOMME parallelizes both MPI and OpenMP over elements<br />Most of the tracer advection can also parallelize over tracers (q) and levels (k)<br />Vertical remap is the exception, due to vertical dependence in levels.<br />Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.<br />
  7. 7. Status<br />Euler_step & laplace_sphere_wk were straightforward to rewrite in CUDA Fortran<br />Vertical Remap was rewritten to be more amenable to GPU (made it vectorize)<br />Resulting code is 2X faster on CPU than original code and has been given back to the community<br />Edge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already)<br />Designed for 1 element per MPI rank, but we plan to run with more<br />Once this is node-aware, it can also be device-aware and greatly reduce PCIe transfers<br />Someone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”<br />
  8. 8. Status (cont.)<br />Kernels were put back into HOMME and validation tests were run and passed<br />This version did nothing to reduce data movement, only tested kernel accuracy<br />In process of porting forward to current trunk and do more intelligent data movement<br />Currently reevaluating directives now that compilers have matured<br />Directives-based vertical remap now slightly outperforms hand-tuned CUDA<br />Still working around derived_type issues<br />
  9. 9. Challenges<br />Data Structures (Object-Oriented Fortran)<br />Every node has an array of element derived types, which contains more arrays<br />We only care about some of these arrays, so data movement isn’t very natural<br />We must essentially change many non-contiguous CPU arrays into a contiguous GPU array<br />Parallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directives<br />Cray compiler handles this via whole program analysis, PGI compiler may support this via inline library<br />
  10. 10. Challenges (cont.)<br />CUDA Fortran requires everything live in the same module<br />Must duplicate some routines and data structures from several module in our “cuda_mod”<br />Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routines<br />Simple for user, but developer must maintain duplicate routines<br />Hey Dave, when will this get changed? ;)<br />
  11. 11. Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.<br />
  12. 12. With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.<br />
  13. 13. Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.<br />
  14. 14. Future Work<br />Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance<br />Move forward to CAM5/CESM1<br />No chance of our work being used otherwise<br />Some additional, small kernels are needed to allow data to remain resident<br />Cheaper to run these on the GPU than to copy the data<br />Reprofile with accelerated application to identify next most important routines<br />Chemisty implicit solver is expected to be next<br />Physics is expected to require mature, directives-based compiler<br />Rinse, repeat<br />
  15. 15. Conclusions<br />Much has been done, much remains<br />For a fairly new, cleanly written code, CUDA Fortran was tractable.<br />HOMME has very similar loop nests throughout, that was key to making this possible<br />Still results in multiple code paths to maintain, so we’d prefer to move to directives for the long-run<br />We believe GPU accelerators will be beneficial for the selected problem<br />We hope that it will also benefit a wider audience (CAM5 should help this)<br />