Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reflecting on the Goal and Baseline of Exascale Computing

203 views

Published on

In this deck from the Blue Waters Summit, Thomas Schulthess from CSCS presents: Reflecting on the Goal and Baseline of Exascale Computing.

"Application performance is given much emphasis in discussions of exascale computing. A 50-fold increase in sustained performance over today's applications running on multi-petaflops supercomputing platforms should be the expected target for exascale systems deployed early next decade. In the present talk, we will reflect on what this means in practice and how much these exascale systems will advance the state of the art. Experience with today's platforms show that there can be an order of magnitude difference in performance within a given class of numerical methods, depending only on choice of architecture and implementation. This bears the questions on what our baseline is, over which the performance improvements of exascale systems will be measured. Furthermore, how close will these exascale systems bring us to deliver on scientific goals, such as convection resolving global climate simulations or throughput oriented computations for meteorology?"

Watch the video: https://wp.me/p3RLHQ-jbr

Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Reflecting on the Goal and Baseline of Exascale Computing

  1. 1. T. Schulthess| Thomas C. Schulthess !1 Reflecting on the Goal and Baseline of Exascale Computing
  2. 2. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves:
  3. 3. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves:
  4. 4. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1,000-fold performance improvement per decade
  5. 5. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1,000-fold performance improvement per decade
  6. 6. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade
  7. 7. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST)
  8. 8. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST)
  9. 9. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  10. 10. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  11. 11. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) 1,000x perf. improv. per decade seems hold for multiple-scattering-theory(MST)- based electronic structure for materials science
  12. 12. T. Schulthess| “Only” 100-fold performance improvement in climate codes !3 Source: Peter Bauer, ECMWFSource: Peter Bauer, ECMWF
  13. 13. T. Schulthess| !4 Has the efficiency of weather & climate codes dropped 10-fold every decade?
  14. 14. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF
  15. 15. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  16. 16. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  17. 17. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  18. 18. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW
  19. 19. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW IBM P6 @ 1.3 MW
  20. 20. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW Cray XT5 @ 1.8 MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW IBM P6 @ 1.3 MW
  21. 21. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW Cray XT5 @ 1.8 MW System size (in energy footprint) grew 
 much faster on “Top500” systems KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW IBM P6 @ 1.3 MW
  22. 22. T. Schulthess| !6 Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
  23. 23. T. Schulthess| !6 Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
  24. 24. T. Schulthess| !6 Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF Can the delivery of a 1km-scale capability be pulled in by a decade?
  25. 25. T. Schulthess| Leadership in weather and climate !7 Peter Bauer, ECMWF
  26. 26. T. Schulthess| Leadership in weather and climate !7 European model may be the best – but far away from sufficient accuracy and reliability! Peter Bauer, ECMWF
  27. 27. T. Schulthess| The impact of resolution: simulated tropical cyclones !8 130 km 60 km 25 km Observations HADGEM3 PRACE UPSCALE, P.L. Vidale (NCAS) and M. Roberts (MO/HC)
  28. 28. T. Schulthess| Resolving convective clouds (convergence?) !9 Source: Christoph Schär, ETH Zurich Bulk convergence Structural convergence Area-averaged bulk effects upon ambient flow: E.g., heating and moistening of cloud layer Statistics of cloud ensemble: E.g., spacing and size of convective clouds
  29. 29. T. Schulthess| Structural and bulk convergence !10 relativefrequency 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 cloud area [km 2 ] 10 −2 10 0 10 2 10 4 71 64 54 47 43 grid-scale clouds [%] convective mass flux [kg m−2 s−1 ] −10 −5 0 5 10 15 relativefrequency 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 10 −7 8 km 4 km 2 km 1 km 500 m Source: Christoph Schär, ETH Zurich Statistics of cloud area Statistics of up- & downdrafts No structural convergence Bulk statistics of updrafts converges Factor 4 (Panosetti et al. 2018)
  30. 30. T. Schulthess| !11 What resolution is needed? • There are threshold scales in the atmosphere and ocean: going from 100 km to 10 km is incremental, 10 km to 1 km is a leap. At 1km • it is no longer necessary to parametrise precipitating convection, ocean eddies, or orographic wave drag and its effect on extratropical storms; • ocean bathymetry, overflows and mixing, as well as regional orographic circulation in the atmosphere become resolved; • the connection between the remaining parametrisation are now on a physical footing. • We spend the last five decades in a paradigm of incremental advances. Here we incrementally improved the resolution of models from 200 to 20km • Exascale allows us to make the leap to 1 km. This fundamentally changes the structure of our models. We move from crude parametric presentations to an explicit, physics based, description of essential processes. • The last such step change was fifty years ago. This was when, in the late 1960s, climate scientists first introduced global climate models, which were distinguished by their ability to explicitly represent extra-tropical storms, ocean gyres and boundary current. Bjorn Stevens, MPI-M
  31. 31. T. Schulthess| Simulation throughput: Simulate Years Per Day (SPYD) !12 NWP Climate in production Climate spinup Simulation 10 d 100 y 5’000 y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD 0.27 2.7 27
  32. 32. T. Schulthess| Simulation throughput: Simulate Years Per Day (SPYD) !12 NWP Climate in production Climate spinup Simulation 10 d 100 y 5’000 y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD 0.27 2.7 27 Minimal throughout 1 SYPD, preferred 5 SYPD
  33. 33. T. Schulthess| Summary of intermediate goal (reach by 2021?) !13 Horizontal resolution 1 km (globally quasi-uniform) Vertical resolution 180 levels (surface to ~100 km) Time resolution Less than 1 minute Coupled Land-surface/ocean/ocean-waves/sea-ice Atmosphere Non-hydrostatic Precision Single (32bit) or mixed precision Compute rate 1 SYPD (simulated year wall-clock day)
  34. 34. T. Schulthess| Running COSMO 5.0 at global scale on Piz Daint !14 Scaling to full system size: ~5300 GPU accelerate nodes available Running a near-global (±80º covering 97% of Earths surface) COSMO 5.0 simulation & IFS > Either on the hosts processors: Intel Xeon E5 2690v3 (Haswell 12c). > Or on the GPU accelerator: PCIe version ofNVIDIA GP100 (Pascal) GPU
  35. 35. T. Schulthess| !15 September 15, 2015 Today’s Outlook: GPU-accelerated Weather Forecasting John Russell “Piz Kesch”
  36. 36. T. Schulthess| !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x
  37. 37. T. Schulthess| !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x We need a 40x improvement between 2012 and 2015 at constant cost
  38. 38. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x We need a 40x improvement between 2012 and 2015 at constant cost
  39. 39. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  40. 40. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  41. 41. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  42. 42. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.3x Change in architecture (CPU ! GPU) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  43. 43. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  44. 44. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) 1.3x additional processors Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  45. 45. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) 1.3x additional processors Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture Bonus: reduction in power! We need a 40x improvement between 2012 and 2015 at constant cost
  46. 46. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) 1.3x additional processors Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture There is no silver bullet! Bonus: reduction in power! We need a 40x improvement between 2012 and 2015 at constant cost
  47. 47. T. Schulthess| Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 !17 Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  48. 48. T. Schulthess| Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 !17 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 climate simulation on 4,888 GPUs at 1 km resolution SC’17, November 2017, Denver, CO 440010 100 1000 #nodes 128x128 160x160 192x192 256x256 bility on the hybrid P100 Piz Daint SMO time step of the dry simula- 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes ∆x = 19 km, P100 ∆x = 19 km, Haswell ∆x = 3.7 km, P100 ∆x = 3.7 km, Haswell ∆x = 1.9 km, P100 ∆x = 930 m, P100 (b) Strong scalability on Piz Daint. (Filled sym- bols) On P100 GPUs, and (empty symbols) on Haswell CPUs, using 12 MPI ranks per node. h xi #nodes t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46⇥1010 1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109 47 km 18 300 9.6 0.099 1.39 ⇥ 107 (c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing ob- tained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps gure 6: (a) Weak and (b) strong scalability results and (c) summary of the time compression achieved in terms of SYPD. ystem. As argued in Section 3, such large timesteps are ble for global climate simulations resolving convective when using implicit solvers), and a maximum timestep level METIS Q COSMO D no merging ˆD registers 1.51 · 109 (20240) 1.72 · 109 (20160) 2.6 · 109 (112618) sh. mem. 64,800 (245) 107,600 (255) 229,120 (2649) Metric: simulated years per wall-clock day Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  49. 49. T. Schulthess| Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 !17 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 climate simulation on 4,888 GPUs at 1 km resolution SC’17, November 2017, Denver, CO 440010 100 1000 #nodes 128x128 160x160 192x192 256x256 bility on the hybrid P100 Piz Daint SMO time step of the dry simula- 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes ∆x = 19 km, P100 ∆x = 19 km, Haswell ∆x = 3.7 km, P100 ∆x = 3.7 km, Haswell ∆x = 1.9 km, P100 ∆x = 930 m, P100 (b) Strong scalability on Piz Daint. (Filled sym- bols) On P100 GPUs, and (empty symbols) on Haswell CPUs, using 12 MPI ranks per node. h xi #nodes t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46⇥1010 1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109 47 km 18 300 9.6 0.099 1.39 ⇥ 107 (c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing ob- tained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps gure 6: (a) Weak and (b) strong scalability results and (c) summary of the time compression achieved in terms of SYPD. ystem. As argued in Section 3, such large timesteps are ble for global climate simulations resolving convective when using implicit solvers), and a maximum timestep level METIS Q COSMO D no merging ˆD registers 1.51 · 109 (20240) 1.72 · 109 (20160) 2.6 · 109 (112618) sh. mem. 64,800 (245) 107,600 (255) 229,120 (2649) Metric: simulated years per wall-clock day 2.5x faster than Yang et al.’s 2016 Gordon Bell winner run on TaihuLight! Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  50. 50. T. Schulthess| The baseline for COSMO-global and IFS !18 Near-global COSMO [Fuh2018] Global IFS [Wed2009] Value Shortfall Value Shortfall Horizontal resolution 0.93 km (non- uniform) 0.9x 1.25 km 1.25x Vertical resolution 60 levels (surface to 25 km) 3x 62 levels (surface to 40 km) 3x Time resolution 6 s (split-explicit with sub- stepping) - 120 s (semi-implicit) 4x Coupled No 1.2x No 1.2x Atmosphere Non-hydrostatic - Non-hydrostatic - Precision Double 0.6x Single - Compute rate 0.043 SYPD 23x 0.088 SYPD 11x Other (I/O, full physics, …) Limited I/O Only microphysics 1.5x Full physics, no I/O - Total shortfall 65x 198x
  51. 51. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  52. 52. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  53. 53. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  54. 54. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 0.88
  55. 55. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Max achievable BW 0.88
  56. 56. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88
  57. 57. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76
  58. 58. T. Schulthess| Memory use efficiency !19 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76
  59. 59. T. Schulthess| Memory use efficiency !19 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76 2x lower than peak BW
  60. 60. T. Schulthess| Memory use efficiency !19 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76 = 0.67 2x lower than peak BW
  61. 61. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20
  62. 62. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x
  63. 63. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x
  64. 64. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic)
  65. 65. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic)
  66. 66. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic) 3. Weak scaling 4x possible in COSMO, but we reduced 
 available parallelism by factor 2 2x
  67. 67. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic) 3. Weak scaling 4x possible in COSMO, but we reduced 
 available parallelism by factor 2 2x 4. Remaining reduction in shortfall 4x Numerical algorithms (larger time steps) Further improved processors / memory
  68. 68. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic) 3. Weak scaling 4x possible in COSMO, but we reduced 
 available parallelism by factor 2 2x 4. Remaining reduction in shortfall 4x Numerical algorithms (larger time steps) Further improved processors / memory But we don’t want to increase the footprint of the 2021 system beyond “Piz Daint”
  69. 69. T. Schulthess| The importance of ensembles !21 Peter Bauer, ECMWF
  70. 70. T. Schulthess| The importance of ensembles !21 Peter Bauer, ECMWF
  71. 71. T. Schulthess| Remaining goals beyond 2021 (by 2024?) !22 1. Improve the throughput to 5 SYPD 2. Reduce the footprint of a single simulation by up to factor 10
  72. 72. T. Schulthess| Data challenge: what is all models run at 1 km scale? !23 With current workflow used by the climate community, 
 IPPC at 1 km scale avg. would produce 50 exabytes of data! The only way out would be to analyse the model while the simulation is running New workflow: 1. first set of runs generate model trajectories and checkpoints 2. Reconstruct model trajectories from checkpoints and analyse
 (as often as necessary)
  73. 73. T. Schulthess| !24 Summary and Conclusions
  74. 74. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future
  75. 75. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future • Given todays challenges use Memory Use Efficiency (MUE) instead MUE = “I/O Efficiency” X “Bandwidth Efficiency”
  76. 76. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future • Given todays challenges use Memory Use Efficiency (MUE) instead MUE = “I/O Efficiency” X “Bandwidth Efficiency” • Convection resolving weather and climate simulations @ 1 km horizontal resolution • Represents a big leap for quality of weather and climate simulations • Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s • Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s • Need a 50-200x performance improvement to reach 2021 goal of 1SYPD
  77. 77. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future • Given todays challenges use Memory Use Efficiency (MUE) instead MUE = “I/O Efficiency” X “Bandwidth Efficiency” • Convection resolving weather and climate simulations @ 1 km horizontal resolution • Represents a big leap for quality of weather and climate simulations • Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s • Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s • Need a 50-200x performance improvement to reach 2021 goal of 1SYPD • System improvement needed • Improve “Bandwidth Efficiency” for regular but non-stride-1 memory access • Improve application software • Further improvement of scalability • Improve algorithms (time integration has potential)
  78. 78. T. Schulthess| !25 Collaborators Tim Palmer (U. of Oxford) Christoph Schar (ETH Zurich) Oliver Fuhrer (MeteoSwiss) Peter Bauer (ECMWF) Bjorn Stevens (MPI-M) Torsten Hoefler (ETH Zurich)Nils Wedi (ECMWF)

×