Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

First Experiences with Parallel Application Development in Fortran 2018

1,548 views

Published on

In this deck from the Stanford HPC Conference, Damian Rouson from the Sourcery Institute presents: First Experiences with Parallel Application Development in Fortran 2018.

"The Fortran standards body recently voted to adopt the informal name "Fortran 2018" for the standard that is expected to be submitted for publication this year and was previously known informally as Fortran 2015. At the 2017 HPC Advisory Council meeting at Stanford, we outlined several challenges facing HPC as we approach the exascale era and several Fortran 2018 features that address those challenges. Since then, we have published two brief papers on our first applications of these features in weather models developed at the National Center for Atmospheric Research (NCAR). One is a coarray Fortran mini-app developed to capture the dominant algorithms of NCAR’s Intermediate Complexity Atmospheric Research model. This talk will present performance and scalability results of the mini-app running on several platforms using up to 98,000 cores. A second application involves the use of teams of images (processes) that execute indecently for ensembles of computational hydrology simulations using WRF-Hyrdro, the hydrological component of the Weather Research Forecasting model also developed at NCAR. Early experiences with portability and programmability of Fortran 2018 will also be discussed."

Watch the video: https://youtu.be/01-ez4v4YPc

Learn more: http://www.sourceryinstitute.org/
and
http://hpcadvisorycouncil.com

Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Published in: Technology
  • Be the first to comment

First Experiences with Parallel Application Development in Fortran 2018

  1. 1. First Experiences with Application Development with Fortran 2018 Damian Rouson
  2. 2. Overview www.yourwebsite.com Fortran 2018 in a Nutshell   Results ICAR & Coarray ICAR  Conclusions  WRF-Hydro
  3. 3. Overview www.yourwebsite.com Fortran 2018 in a Nutshell — Motivation: exascale challenges — Background: SPMD & PGAS in Fortran 2018 — Beyond CAF    Results ICAR & Coarray ICAR  Conclusions  WRF-Hydro
  4. 4. EXASCALE CHALLENGES & Fortran 2018 Response Slide / 01 Billion-way concurrency with high levels of on-chip parallelism — events — collective subroutines — richer set of atomic subroutines — teams     Expensive data movement — one-sided communication — teams (locality control) Higher failure rates — failed-image detection Heterogeneous hardware on processor — events Source: Ang, J.A., Barrett, R.F., Benner, R.E., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S.D., Hemmert, K.S., Kelly, S.M. and Le, H., 2014, November. Abstract machine models and proxy architectures for exascale computing. In Hardware-Software Co-Design for High Performance Computing (Co-HPC), 2014 (pp. 25-32). IEEE.
  5. 5. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. Image 1 Image 2 Image N Images
  6. 6. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. Image 1 Image 2 Image N Images
  7. 7. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. ... Image 1 Image 2 Image N Images {processes | threads}
  8. 8. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. ... Image 1 Image 2 Image N Images {processes | threads}
  9. 9. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. ... sync all Image 1 Image 2 Image N Images {processes | threads}
  10. 10. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. ... sync allsync all Image 1 Image 2 Image N Images {processes | threads}
  11. 11. Single Program Multiple Data (SPMD) Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. ... sync allsync all sync all Image 1 Image 2 Image N Images {processes | threads} Images execute asynchronously up to a programmer-specified synchronization: sync all sync images allocate/deallocate
  12. 12. Partitioned Global Address Space (PGAS)
  13. 13. Partitioned Global Address Space (PGAS)   Communicate objects Fortran 90 array syntax works on local data. Coarrays integrate with other languages features:
  14. 14. Partitioned Global Address Space (PGAS) The ability to drop the square brackets harbors two important implications: Easily determine where communication occurs.   Parallelize legacy code with minimal revisions.   Communicate objects Fortran 90 array syntax works on local data. Coarrays integrate with other languages features:
  15. 15. Sign Image 1 Image 2 Image 3 if (me<n) a(1:2) = a(2:3)[me+1]
  16. 16. Sign Image 1 Image 2 Image 3 if (me<n) a(1:2) = a(2:3)[me+1]
  17. 17. Sign Image 1 Image 2 Image 3 if (me<n) a(1:2) = a(2:3)[me+1]
  18. 18. Sign Image 1 Image 2 Image 3 if (me<n) a(1:2) = a(2:3)[me+1]
  19. 19. Sign Image 1 Image 2 Image 3 if (me<n) a(1:2) = a(2:3)[me+1]
  20. 20. Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. Image 1 Image 2 Image 3 if (me==1) a(5)[2] = a(5)[3]
  21. 21. Sign Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove. Image 1 Image 2 Image 3 if (me==1) a(5)[2] = a(5)[3]
  22. 22. Segment Ordering: EventsAn intrinsic module provides the derived type event_type, which encapsulates an atomic_int_kind integer component default-initialized to zero. An image increments the event count on a remote image by executing event post.   The remote image obtains the post count by executing event_query. Image Control Side Effect event post x atomic_add 1 event_query defines count event wait x atomic_add -1
  23. 23. Events Hello, world! greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  24. 24. Events Hello, world! greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  25. 25. Events Hello, world! query greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  26. 26. Events Hello, world! query greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  27. 27. Events Hello, world! post query greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  28. 28. Events Hello, world! post query greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  29. 29. Events Hello, world! post query wait greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  30. 30. Events Hello, world! post query wait greeting_ready(2:n)[1] ok_to_overwrite[3] ...
  31. 31. Events Hello, world! post query wait greeting_ready(2:n)[1] ok_to_overwrite[3] post ...
  32. 32. Performance-oriented constraints: — Query and wait must be local. — Post and wait are disallowed in do concurrent constructs. Events Hello, world! post query wait greeting_ready(2:n)[1] ok_to_overwrite[3] post ...
  33. 33. Performance-oriented constraints: — Query and wait must be local. — Post and wait are disallowed in do concurrent constructs. Events Hello, world! post query wait greeting_ready(2:n)[1] ok_to_overwrite[3] post ... Pro tips: — Overlap communication and computation. — Wherever safety permits, query without waiting. — Write a spin-query-work loop & build a logical mask describing the remaining work.
  34. 34. Team 2 a(1:4)[1] a(1:4)[2] a(1:4)[3] a(1:4)[4] A set of images that can readily execute independently of other images TEAM Team 1 Team 2 a(1:4)[2] Image 1 Image 3Image 2 Image 4 Image 5 Image 6 a(1:4)[4] a(1:4)[5] a(1:4)[6]
  35. 35. Collective Subroutines www.yourwebsite.com Each non-failed image of the current team must invoke the collective.     All collectives have intent(inout) argument A holding the input data and may hold the result on return, depending on result_image Optional arguments: stat, errmsg, result_image, source_image After parallel calculation/communication, the result is placed on one or all images. No implicit synchronization at beginning/end, which allows for overlap with other actions.   No image’s data is accessed before that image invokes the collective subroutine.
  36. 36. Extensible Set of Collective Subroutines
  37. 37. 1e-05 0.0001 0.001 0.01 0.1 1 32 64 128 256 512 ExecutionTime(sec) Num. of images co_sum_int32 sum_bin_tree_int32 sum_bin_tree_real64 sum_rec_doubling_int32 sum_rec_doubling_real64 sum_alpha_tree_real64 NERSC Hopper: Xeon nodes on Cray XE6 KNL nodes on a Cray XC Performance User-Defined vs Intrinsic Collectives
  38. 38. Failure Detection use iso_fortran_env, only : STAT_FAILED_IMAGE integer :: status sync all(stat==status) if (status==STAT_FAILED_IMAGE) call fault_tolerant_algorithm() Image 1 Image 3Image 2 a(1:4)[1] a(1:4)[2] a(1:4)[3] Image 4 Image 5 a(1:4)[4] a(1:4)[5] a(1:4)[n] Image n …
  39. 39. Failure Detection use iso_fortran_env, only : STAT_FAILED_IMAGE integer :: status sync all(stat==status) if (status==STAT_FAILED_IMAGE) call fault_tolerant_algorithm() Image 1 Image 3Image 2 a(1:4)[1] a(1:4)[2] a(1:4)[3] Image 4 Image 5 a(1:4)[4] a(1:4)[5] a(1:4)[n] Image n …
  40. 40. Failure Detection Image 1 Image 3Image 2 a(1:4)[1] a(1:4)[2] a(1:4)[3] Image 4 Image 5 a(1:4)[4] a(1:4)[5] a(1:4)[n] Image n …
  41. 41. FORTRAN 2018 Failed-Image Detection FAIL IMAGE (simulates a failures) IMAGE_STATUS (checks the status of a specific image) FAILED_IMAGES (provides the list of failed images) Coarray operations may fail. STAT= attribute used to check correct behavior.
  42. 42. Overview www.yourwebsite.com Fortran 2018 in a Nutshell   Results ICAR & Coarray ICAR  Conclusions
  43. 43. PrecipitationTopography The Climate Downscaling Problem Climate model (top row) and application needs (bottom row) Computational Cost High-res regional model >10 billion core hours (CONUS 4km 150yrs 40 scenarios)
  44. 44. The Intermediate Complexity Atmospheric Research Model (ICAR) ICAR Wind Field over Topography Analytical solution for flow over topography (right) 90% of the information for 1% of the computational cost Core Numerics: 90% of cost = Cloud physics grid-cells independent 5% of cost = Advection fully explicit, requires local neighbor communication Gutmann et al (2016) J. of Hydrometeorology, 17 p957. doi:10.1175/JHM-D-15-0155.1
  45. 45. ICAR Intermediate Complexity Atmospheric Research Animation of Water vapor (blue) and Precipitation (Green to Red) over the contiguous United States Output timestep : 1hr Integration timestep : Variable (~30s) Boundary Conditions: ERA-interim (historical) Run on 16 cores with OpenMP Limited scalability
  46. 46. ICAR Intermediate Complexity Atmospheric Research Animation of Water vapor (blue) and Precipitation (Green to Red) over the contiguous United States Output timestep : 1hr Integration timestep : Variable (~30s) Boundary Conditions: ERA-interim (historical) Run on 16 cores with OpenMP Limited scalability
  47. 47. ICAR comparison to “Full-physics” atmospheric model (WRF) Ideal Hill case ICAR (red) and WRF (blue) precipitation over an idealized hill (green) Real Simulation WRF (left) and ICAR (right) Season total precipitation over Colorado Rockies
  48. 48. ICAR Users and Applications http://github.com/NCAR/icar.git
  49. 49. Coarray ICAR Mini-App www.yourwebsite.com Object-oriented design    Overlaps communication & computation via one-sided “puts.” Coarray halo exchanges with 3D, 1st-order upwind advection (~2000 lines of new code) Collective broadcast of initial data Cloud microphysics (~5000 lines of pre- existing code) 
  50. 50. Overview www.yourwebsite.com Fortran 2018 in a Nutshell   Results ICAR & Coarray ICAR  Conclusions  WRF-Hydro
  51. 51. WRF-Hydro Weather Research Forecast — Hydrological Model Currently, ensemble runs involve redundant initialization calculations that occupy approximately 30% of the runtime and use only the processes allocated to one ensemble member. Our use of Fortran 2018 teams will eliminate the redundancy and the amortization of that effort across all of the processes allocated for the full ensemble of simulations.
  52. 52. ICAR comparison to “Full-physics” atmospheric model (WRF) Ideal Hill case ICAR (red) and WRF (blue) precipitation over an idealized hill (green) Real Simulation WRF (left) and ICAR (right) Season total precipitation over Colorado Rockies
  53. 53. form team(…) end team team_number(…) 1 2 3 color MPI_Barrier(…) MPI_Comm_Split(…) MPI_Barrier(…) Fortran statement or procedure MPI procedure or variable time 1 2 3 Mapping Fortran Teams onto MPI
  54. 54. Communicator Hand-off (CAF to MPI)
  55. 55. Communicator Hand-off (CAF to MPI)   Language extension exposes MPI communicator CAF compiler embeds MPI_Init, MPI_Finalize CAF in the driver seat:
  56. 56. Communicator Hand-off (CAF to MPI) MPI under the hood: Only trivial modification of MPI application   Future work: amortize & speed up initialization   Language extension exposes MPI communicator CAF compiler embeds MPI_Init, MPI_Finalize CAF in the driver seat:
  57. 57. Overview www.yourwebsite.com Fortran 2018 in a Nutshell   Results ICAR & Coarray ICAR  Conclusions  WRF-Hydro
  58. 58. Coarray ICAR Simulation Time 500 x 500 x 20 grid 2000 x 2000 x 20 grid OpenSHMEM vs MPI Puts vs gets
 Number of processes Number of processes GCC 6.3 on NCAR “Cheyenne” SGI ICE XA Cluster: 4032 nodes, 2 x 18-core Xeon Broadwell @ 2.3 GHz
  59. 59. Coarray ICAR Speedup 500 x 500 x 20 grid 2000 x 2000 x 20 grid Number of processes Number of processes GCC 6.3 on NCAR “Cheyenne” SGI ICE XA Cluster: 4032 nodes, 2 x 18-core Xeon Broadwell @ 2.3 GHz
  60. 60. Coarray ICAR Xeon vs KNL Number of processes Number of processes Compilers & platforms: GNU on Cheyenne; Cray compiler on Edison (Broadwell), Cori (KNL).
  61. 61. Coarray ICAR At Scale Number of processes Number of processes Cray compiler on Edison.
  62. 62. Conclusions www.yourwebsite.com Fortran 2018 is a PGAS language supporting SPMD programming at scale.    High productivity pays off: from shared- memory parallelism to 100,000 cores in ~100 person-hours. Programming-model agnosticism is a life- saver.  CAF interoperates seamlessly with MPI
  63. 63. Acknowledgements & References Ethan Gutman Alessandro Fanfarillo James McCreight Brian Friesen Rouson, D., Gutmann, E. D., Fanfarillo, A., & Friesen, B. (2017, November). Performance portability of an intermediate- complexity atmospheric research model in coarray Fortran. In Proceedings of the Second Annual PGAS Applications Workshop (p. 4). ACM. Rouson, D., McCreight, J. L., & Fanfarillo, A. (2017, November). Incremental caffeination of a terrestrial hydrological modeling framework using Fortran 2018 teams. In Proceedings of the Second Annual PGAS Applications Workshop (p. 6). ACM.

×