In this deck from the Stanford HPC Conference, Damian Rouson from the Sourcery Institute presents: First Experiences with Parallel Application Development in Fortran 2018.
"The Fortran standards body recently voted to adopt the informal name "Fortran 2018" for the standard that is expected to be submitted for publication this year and was previously known informally as Fortran 2015. At the 2017 HPC Advisory Council meeting at Stanford, we outlined several challenges facing HPC as we approach the exascale era and several Fortran 2018 features that address those challenges. Since then, we have published two brief papers on our first applications of these features in weather models developed at the National Center for Atmospheric Research (NCAR). One is a coarray Fortran mini-app developed to capture the dominant algorithms of NCAR’s Intermediate Complexity Atmospheric Research model. This talk will present performance and scalability results of the mini-app running on several platforms using up to 98,000 cores. A second application involves the use of teams of images (processes) that execute indecently for ensembles of computational hydrology simulations using WRF-Hyrdro, the hydrological component of the Weather Research Forecasting model also developed at NCAR. Early experiences with portability and programmability of Fortran 2018 will also be discussed."
Watch the video: https://youtu.be/01-ez4v4YPc
Learn more: http://www.sourceryinstitute.org/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
4. EXASCALE
CHALLENGES
& Fortran 2018
Response
Slide / 01
Billion-way concurrency with high levels of on-chip parallelism
— events
— collective subroutines
— richer set of atomic subroutines
— teams
Expensive data movement
— one-sided communication
— teams (locality control)
Higher failure rates
— failed-image detection
Heterogeneous hardware on processor
— events
Source: Ang, J.A., Barrett, R.F., Benner, R.E., Burke, D., Chan,
C., Cook, J., Donofrio, D., Hammond, S.D., Hemmert, K.S.,
Kelly, S.M. and Le, H., 2014, November. Abstract machine
models and proxy architectures for exascale computing. In
Hardware-Software Co-Design for High Performance Computing
(Co-HPC), 2014 (pp. 25-32). IEEE.
5. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
Image 1 Image 2 Image N
Images
6. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
Image 1 Image 2 Image N
Images
7. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
...
Image 1 Image 2 Image N
Images {processes | threads}
8. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
...
Image 1 Image 2 Image N
Images {processes | threads}
9. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
...
sync all
Image 1 Image 2 Image N
Images {processes | threads}
10. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
...
sync allsync all
Image 1 Image 2 Image N
Images {processes | threads}
11. Single Program Multiple Data
(SPMD)
Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
...
sync allsync all sync all
Image 1 Image 2 Image N
Images {processes | threads}
Images execute asynchronously up to a programmer-specified synchronization:
sync all
sync images
allocate/deallocate
13. Partitioned Global Address Space
(PGAS)
Communicate objects
Fortran 90 array syntax works
on local data.
Coarrays integrate with other
languages features:
14. Partitioned Global Address Space
(PGAS)
The ability to drop the square
brackets harbors two important
implications:
Easily determine where
communication occurs.
Parallelize legacy code with
minimal revisions.
Communicate objects
Fortran 90 array syntax works
on local data.
Coarrays integrate with other
languages features:
20. Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
Image 1 Image 2 Image 3
if (me==1) a(5)[2] = a(5)[3]
21. Sign
Far far away, behind the word
mountains, far from the countries
Vokalia and Consonantia, there live
the blind texts. Separated they live
in Bookmarksgrove.
Image 1 Image 2 Image 3
if (me==1) a(5)[2] = a(5)[3]
22. Segment Ordering:
EventsAn intrinsic module provides the derived type event_type,
which encapsulates an atomic_int_kind integer
component default-initialized to zero.
An image increments the event count on a
remote image by executing event post.
The remote image obtains the post count
by executing event_query.
Image Control Side Effect
event post x atomic_add 1
event_query defines count
event wait x atomic_add -1
32. Performance-oriented constraints:
— Query and wait must be local.
— Post and wait are disallowed in do concurrent constructs.
Events
Hello, world!
post
query
wait
greeting_ready(2:n)[1] ok_to_overwrite[3]
post
...
33. Performance-oriented constraints:
— Query and wait must be local.
— Post and wait are disallowed in do concurrent constructs.
Events
Hello, world!
post
query
wait
greeting_ready(2:n)[1] ok_to_overwrite[3]
post
...
Pro tips:
— Overlap communication and computation.
— Wherever safety permits, query without waiting.
— Write a spin-query-work loop & build a logical mask describing the remaining work.
34. Team 2
a(1:4)[1] a(1:4)[2]
a(1:4)[3]
a(1:4)[4]
A set of images that can readily execute independently of other images
TEAM
Team 1
Team 2
a(1:4)[2]
Image 1 Image 3Image 2 Image 4 Image 5 Image 6
a(1:4)[4] a(1:4)[5] a(1:4)[6]
35. Collective Subroutines
www.yourwebsite.com
Each non-failed image of the current team must invoke
the collective.
All collectives have intent(inout) argument
A holding the input data and may hold the
result on return, depending on result_image
Optional arguments: stat, errmsg,
result_image, source_image
After parallel calculation/communication, the result
is placed on one or all images.
No implicit synchronization at beginning/end,
which allows for overlap with other actions.
No image’s data is accessed before that
image invokes the collective subroutine.
41. FORTRAN 2018
Failed-Image Detection
FAIL IMAGE (simulates a failures)
IMAGE_STATUS (checks the status of a specific image)
FAILED_IMAGES (provides the list of failed images)
Coarray operations may fail. STAT= attribute used to check correct behavior.
43. PrecipitationTopography
The Climate Downscaling Problem
Climate model (top row) and application needs (bottom row)
Computational Cost
High-res regional model
>10 billion core hours
(CONUS 4km 150yrs 40 scenarios)
44. The Intermediate Complexity Atmospheric
Research Model (ICAR)
ICAR Wind Field over
Topography
Analytical solution for flow over
topography (right)
90% of the information for 1% of
the computational cost
Core Numerics:
90% of cost = Cloud physics
grid-cells independent
5% of cost = Advection
fully explicit, requires local
neighbor communication
Gutmann et al (2016) J. of Hydrometeorology, 17 p957. doi:10.1175/JHM-D-15-0155.1
45. ICAR
Intermediate
Complexity
Atmospheric
Research
Animation of
Water vapor (blue) and
Precipitation (Green to Red)
over the contiguous United States
Output timestep : 1hr
Integration timestep : Variable (~30s)
Boundary Conditions: ERA-interim (historical)
Run on 16 cores with OpenMP
Limited scalability
46. ICAR
Intermediate
Complexity
Atmospheric
Research
Animation of
Water vapor (blue) and
Precipitation (Green to Red)
over the contiguous United States
Output timestep : 1hr
Integration timestep : Variable (~30s)
Boundary Conditions: ERA-interim (historical)
Run on 16 cores with OpenMP
Limited scalability
47. ICAR comparison to “Full-physics”
atmospheric model (WRF)
Ideal Hill case
ICAR (red) and WRF (blue)
precipitation over an idealized
hill (green)
Real Simulation
WRF (left) and ICAR (right)
Season total precipitation over
Colorado Rockies
48. ICAR Users and Applications
http://github.com/NCAR/icar.git
49. Coarray ICAR Mini-App
www.yourwebsite.com
Object-oriented design
Overlaps communication & computation via
one-sided “puts.”
Coarray halo exchanges with 3D, 1st-order upwind
advection (~2000 lines of new code)
Collective broadcast of initial data
Cloud microphysics (~5000 lines of pre-
existing code)
51. WRF-Hydro
Weather
Research
Forecast —
Hydrological
Model
Currently, ensemble runs involve redundant
initialization calculations that occupy
approximately 30% of the runtime and use only the
processes allocated to one ensemble member.
Our use of Fortran 2018 teams will eliminate the
redundancy and the amortization of that effort
across all of the processes allocated for the full
ensemble of simulations.
52. ICAR comparison to “Full-physics”
atmospheric model (WRF)
Ideal Hill case
ICAR (red) and WRF (blue)
precipitation over an idealized
hill (green)
Real Simulation
WRF (left) and ICAR (right)
Season total precipitation over
Colorado Rockies
53. form team(…)
end team
team_number(…)
1 2 3
color
MPI_Barrier(…)
MPI_Comm_Split(…)
MPI_Barrier(…)
Fortran statement
or procedure
MPI procedure
or variable
time
1 2 3
Mapping Fortran Teams onto MPI
55. Communicator Hand-off
(CAF to MPI)
Language extension
exposes MPI
communicator
CAF compiler embeds
MPI_Init, MPI_Finalize
CAF in the driver seat:
56. Communicator Hand-off
(CAF to MPI)
MPI under the hood:
Only trivial modification
of MPI application
Future work: amortize
& speed up initialization
Language extension
exposes MPI
communicator
CAF compiler embeds
MPI_Init, MPI_Finalize
CAF in the driver seat:
58. Coarray ICAR
Simulation Time
500 x 500 x 20 grid 2000 x 2000 x 20 grid
OpenSHMEM vs MPI
Puts vs gets
Number of processes Number of processes
GCC 6.3 on NCAR “Cheyenne” SGI ICE XA Cluster: 4032 nodes, 2 x 18-core Xeon Broadwell @ 2.3 GHz
59. Coarray ICAR
Speedup
500 x 500 x 20 grid 2000 x 2000 x 20 grid
Number of processes Number of processes
GCC 6.3 on NCAR “Cheyenne” SGI ICE XA Cluster: 4032 nodes, 2 x 18-core Xeon Broadwell @ 2.3 GHz
60. Coarray ICAR
Xeon vs KNL
Number of processes Number of processes
Compilers & platforms: GNU on Cheyenne; Cray compiler on Edison (Broadwell), Cori (KNL).
62. Conclusions
www.yourwebsite.com
Fortran 2018 is a PGAS language supporting
SPMD programming at scale.
High productivity pays off: from shared-
memory parallelism to 100,000 cores in
~100 person-hours.
Programming-model agnosticism is a life-
saver.
CAF interoperates seamlessly with MPI
63. Acknowledgements & References
Ethan Gutman
Alessandro Fanfarillo
James McCreight Brian Friesen
Rouson, D., Gutmann, E. D., Fanfarillo, A., & Friesen, B. (2017, November). Performance portability of an intermediate-
complexity atmospheric research model in coarray Fortran. In Proceedings of the Second Annual PGAS Applications
Workshop (p. 4). ACM.
Rouson, D., McCreight, J. L., & Fanfarillo, A. (2017, November). Incremental caffeination of a terrestrial hydrological modeling
framework using Fortran 2018 teams. In Proceedings of the Second Annual PGAS Applications Workshop (p. 6). ACM.