SlideShare a Scribd company logo
Do theoretical FLOPs matter for
     real application’s performance ?
                   By   Joshua.Mora@amd.com
Abstract: The most intelligent answer to this question is “it depends on
the application”. To proof that, we will show a few examples from both
theoretical and practical point of view. In order to validate experimentally
it, a modified AMD processor named “Fangio” (AMD Opteron 6275
Processor) will be used which has limited floating point capability to 2
FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the
performance of AMD Opteron 6276 Processor with 4 times more floating
point capability , ie. 8 FLOPs/clk/BD unit.
The intention of this work is threefold: i) to demonstrate that the
FLOPs/clk/core of microprocessor architectures isn’t necessarily a good
performance metric indicator despite it is heavily used by the industry (eg.
HPL). ii) to expose that code vectorization technology of compilers is
fundamental in order to extract as much as possible real application
performance but it has a long way to go in extracting it. iii) It would not be
fair to blame exclusively on compiler technology: algorithms are not well
designed and written for the compilers to exploit vector instructions (ie.
SSE, AVX and FMA).
                                             Saudi Arabia HPC, KAUST, Thuwal, 2012
Agenda
• Concepts
   – Kinds of FLOPs/clk
   – Single Instruction Single Data , Single Instruction Multiple Data
• AMD Interlagos processor FPU
   – FPU, see Understanding Interlagos arch through HPL (HPC
     Advisory Council workshop, ISC 2012)
   – Roofline model
• AMD Fangio processor
   – FPU capping, roofline model
• Results/Conclusions within roofline model for Interlagos and
  Fangio.
   – Benchmarks: HPL, stream, CFD apps, SPEC fp benchmarks.


                                          Saudi Arabia HPC, KAUST, Thuwal, 2012
Concepts: kinds of FLOPs/clk
   • A brief list of floating point instructions and examples
     supported by AMD Interlagos
   • Scalar, Packed
   • SP: Single precision, DP: Double precision
Ins. Type   Examples                                         FLOPs/clk   Reg. size

X87         FADD , FMUL                                      1           32,64bits
SSE (SP)    Scalar: ADDSS , MULSS   Packed: ADDPS,MULPS      8           128bits
SSE2 (DP)   Scalar: ADDSD, MULSD    Packed: ADDPD,MULPD      4           128bits
AVX (SP)    Scalar: VADDSS, MULSS   Packed: VADDPS, VMULPS   4, 8        128,256b
AVX (DP)    Scalar: ADDSD, MULSD    Packed: VADDPD, VMULPD   2, 4        128,256b
FMA4 (SP)   Scalar: VFMADDSS        Packed: VFMADDPS         8, 16       128,256b
FMA4 (DP) Scalar: VFMADDSD          Packed: VFMADDPD         4, 8        128,256b

                                                Saudi Arabia HPC, KAUST, Thuwal, 2012
SISD: Single Instruction Single Data
SIMD: Single Instruction Multiple Data
Current CPU cores can crunch 8 DP numbers at a time. GPUs streaming cores
can crunch 2-4DP numbers. There are several thousand streaming cores per
GPU.                                                                 Bubbles
                                                                           (no work)




                                                              streams of inputs
                             single data
                                                              of data and results,
                             input and
                                                              stored in vectors or
                             result at each
                                                              packed format,
                             clock,
                                                              SSE, AVX, FMA
                             stored in
                             scalar format                       1 clock

                             SIMD allows processing of more data but it needs to
                             be formated / packed to fit vectors. THAT IS THE
                             CHALLENGE
                                             Saudi Arabia HPC, KAUST, Thuwal, 2012
Few slides from presentation:




                    Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Saudi Arabia HPC, KAUST, Thuwal, 2012
Roof line for AMD Interlagos
System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 8 DP FLOP/clk/core-pair x 0.85eff =
250 DP GF/s
Very high numerical intensity ie. (FLOP/s) / (Byte/s)
          -use of AMD Core Math Library (FMA4 instructions)
          -cache friendly
          -reusability of data
          -DGEMM is L3 BLAS with Arithmetic Intensity order N (problem size)
Real GB/s: 72 GB/s (stream benchmark)
Low numerical intensity
          -use of non temporal stores (use of write combined buffer instead of evicting
            data into L2 -> L3 -> RAM to speed up the write into RAM.)
          -not cache friendly
          -no reusage of data
          -most of the time cores waiting for data (low FLOP/clk despite using SSE2,FMA4)
          - stream is L1 BLAS with Arithmetic Intensity order 1 (size independent)
                                                      Saudi Arabia HPC, KAUST, Thuwal, 2012
AMD Fangio, FPU capping
• Fangio is Interlagos processor model 6276 but has capped FPU
  from 8 DP FLOP/clk to 2 DP FLOP/clk by slowing down FPU
  retirement of instructions.
• Allows same instruction architecture set as Interlagos.
• System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.

Performance impact depends on workload:
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 2 DP FLOP/clk/core-pair x 0.85eff =
75 DP GF/s
Real GB/s: 72 GB/s (stream benchmark)
unmodified memory throughput performance !
                                         Saudi Arabia HPC, KAUST, Thuwal, 2012
HPL runs to confirm FPU capping
2P 16cores @ 2.3GHz (6276 Interlagos)
==============================================================================
T/V                    N        NB P Q                 Time                Gflops
--------------------------------------------------------------------------------
WR01R2L4            86400 100 4 8                     1774.55             2.423e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=                                0.0022068 ...... PASSED
==============================================================================


2P 16cores @ 2.3GHz (6275 Interlagos) Fangio                                  3x longer time
==============================================================================
T/V                    N       NB P Q                 Time               Gflops
                                                                                           78.26GF/s
--------------------------------------------------------------------------------
                                                                                      (16CU*2GF/clk/CU*2.3GHz)
WR01R2L4            86400 100 4 8                     5494.39           7.826e+01     78.26 GF/s 106% HPL eff !!
--------------------------------------------------------------------------------       73.6 GF/s
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=                                0.0022316 ...... PASSED
==============================================================================
                                                                          Saudi Arabia HPC, KAUST, Thuwal, 2012
Stream runs on 6275 to confirm
no drop in memory throughput
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy:        73089.4045          0.0443         0.0438        0.0449
Scale: 68952.3038                0.0469        0.0464         0.0472
Add:        66289.3072           0.0729        0.0724         0.0734
Triad: 66301.0957                0.0730        0.0724         0.0734
-------------------------------------------------------------
Scale, Add and Triad do FLOPS in double precision.
Triad is plotted in roofline model since it is the one with highest
FLOPs associated with operations: add and multiply using FMA4.
#pragma omp parallel for
      for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j];
                                        Saudi Arabia HPC, KAUST, Thuwal, 2012
Summary of measurements per node
      plotted in roofline model Not the effective freq
                                due to boost.
                  (2*7.8)/(2.3GHz*8) = 85% eff     (2*2.3)/(2.3GHz*2) = 100% eff !!

Workload               Interlagos 6276                   Interlagos 6275 “Fangio”
               GB/s        DP GF/s       AI= F/B       GB/s         DP GF/s      AI= F/B
HPL          =6*4=24     =7.8*32=250 10.6          =1.8*4=6.8 =2.3*32=75         11.7
STR. TRIAD   =17*4=68 =0.5*32=16         0.08      =17*4=68      =0.5*32=16      0.08
OPENFOAM =15*4=60 =0.8*32=25             0.41      =14*4=56      =0.7*32=22      0.39

      • 1 computer has 2 processors with a total of 4 numanodes and
       32 cores in 16 compute units
      • 1 numanode has a total of 4 compute units.
      • Memory bandwidth in GB/s is measured per numanode. (in red)
      • Double precision floating point is measured per core.(in red)

                                                   Saudi Arabia HPC, KAUST, Thuwal, 2012
Roofline for Interlagos and Fangio
GF/s (log 2 scale)                  Measured and plotted

                                                       AMD Interlagos (250GF/s)
 256       Both processors have same
           memory bandwidth,                                       HPL      (Interlagos)
           Ie. same BW slope.                   L3 BLAS (eg. Dgemm),
 128                                          benefits from vectorization: FMA, AVX, SSE
                                            AMD Fangio (75GF/s)            75% perf drop
  64
                                                                  HPL (Fangio)
  32                                   SPECfp 3-20%, 8% average perf drop

  16                                Sparse algebra such as CFD apps ~ 6-8% perf drop
                                    OpenFOAM, FLUENT, STARCCM,..
           TRIAD
   8       0% perf drop
                            Data dependencies, scalar code, no benefits from vectorization

         0.125       0.25   0.5      1         2        4        8         16         32
                                            (GF/s)/(GB/s) Arithmetic intensity (log 2 scale)
                                                     Saudi Arabia HPC, KAUST, Thuwal, 2012
Performance impact on
 SPEC fp 2006 rate peak                        Resource   utilization


• SPEC website link: www.spec.org
• Runs done with peak flags
  configuration in order to utilize
  optimally compiler technology.
• In this case Open64 compiler has
  been used.
• Runs were done with only 1 copy
  per Bulldozer unit, to allow each
  process/copy to fully utilize
  available computing resources
  without constrains originated from
  shared resources in the Bulldozer
  compute unit (eg. L2, FPU ,
  instruction scheduler).



                                       Saudi Arabia HPC, KAUST, Thuwal, 2012
Performance impact on
   SPEC fp 2006 rate peak (cont)
Benchmark                       Brief                                           % perf.
           Application area
name                            description                                     drop
 Bwaves   Fluid Dynamics        3D transonic transient laminar viscous flow. 0.09%
                                self-consistent field calculations are
 Gamess     Quantum
                                performed using the Restricted Hartree       -10.51%
            Chemistry.
                                Fock method
 Milc       Quantum             A gauge field generating program for lattice
                                                                             0.10%
            Chromodynamics      gauge theory programs with dynamical quarks.
 Zeusmp                          NCSA code, CFD simulation of astrophysical
            Fluid Dynamics                                                   -7.47%
                                phenomena.
            Biochemistry /
 Gromacs                        Newtonian equations of motion for GPU
            Molecular                                                          -32.17%
                                hundreds to millions of particle. candidate
            Dynamics
                                Solves the Einstein evolution equations using a
 cactusADM General Relativity                                                   -2.01%
                                staggered-leapfrog numerical method
 Leslie3d   Fluid Dynamics      CFD, Large Eddy Simulation                     -0.44%
                                                                GPU
            Biology / Molecular Large biomolecular systems. The test case
 Namd                                                                          -24.23%
            Dynamics                                            candidate
                                has 92,224 atoms of apolipoprotein A-I.
                                                 Saudi Arabia HPC, KAUST, Thuwal, 2012
Benchmark     Application area
                             Brief Description                              % perf. drop
             Finite Element  adaptive finite elements and error estimation.
 dealII                                                                          -9.09%
             Analysis        Helmholtz-type equation.
             Linear          simplex algorithm and sparse linear algebra.
 Soplex      Programming,    Test cases include railroad planning and             1.86%
             Optimization    military airlift models.
                             Image rendering. The testcase is a
 Povray   Image Ray-tracing                                                     -12.15%
                             1280x1024 anti-aliased image of a landscape.
          Structural                                          GPU
                             Finite element code for linear and
 Calculix                                                                       -26.82%
          Mechanics                                           candidate
                             nonlinear 3D structural applications.
          Computational      Solves the Maxwell equations in 3D using the
 GemsFDTD                                                                        -0.67%
          Electromagnetics   finite-difference time-domain (FDTD) method.
                             molecular Hartree-Fock wavefunction
          Quantum
 Tonto                       calculation to better match experimental           -14.43%
          Chemistry
                             X-ray diffraction data.
                             "Lattice-Boltzmann Method" to simulate
 Lbm      Fluid Dynamics                                                           0.58%
                             incompressible fluids in 3D
                             Weather modeling from scales of meters
 Wrf      Weather                                                                -0.95%
                             to thousands of kilometers.
                             speech recognition system from Carnegie
 Sphinx3  Speech recognition                                                     -3.00%
                             Mellon University
AVERAGE REAL PERFORMANCE DROP WHEN
REDUCED 75% THE THEORETICAL FLOPs                                           -8.94%
                                                   Saudi Arabia HPC, KAUST, Thuwal, 2012
most
  Performance impact on CFD apps
• Most of CFD apps with Eulerian formulation use sparse linear algebra to
  represent the linearized Navier-Stokes equations on non structured grids.
• The higher the discretization schemes, the higher the arithmetic intensity
• Data dependencies in both spatial and time prevent vectorization
• Large datasets have low cache reutilization.
• Cores are waiting most of the time to get new data into caches.
• Once data is on the caches, the floating point instructions are mostly
  scalar instead of packed.
• Compilers have hard time in finding opportunities to vectorize loops.
• Loop unrolling and partial vectorization of independent data help very
  little due to cores waiting to get that data.
• Overall, low performance from FLOPs/s point of view.
• Therefore, capping FPU in terms of FLOPs/clk does not impact on
  application’s performance.
• Theoretical FLOP/s isn’t therefore a good indicator of how
  applications such as CFD ones (and many more) will perform.
                                            Saudi Arabia HPC, KAUST, Thuwal, 2012
What should we do, moving forward ?
• Multidisciplinary teams to work on
   – Algorithm research and development to make it more hardware aware.
   – Software research and development to implement efficiently the
     algorithms. (eg. Comm avoidance, dynamic task scheduling, work stealing,
     locality, power aware, resilient ,…)
   – Interaction between scientist and computer (HW+SW) scientist to develop
     new formulations of equations that will deliver algorithms better suited
     for new computer architectures.
   – Research and development on compiler and progr. language technology
     to detect algorithm properties and exploit hardware features.
• Supercomputing datacenter institutions to work on
   – Enabling science by proper exploitation of computational resources.
   – Multidisciplinary teams educating scientist on how to use the resources.
   – Supercomputing investments should be funded and measured in terms of
     number and quality of scientific projects, not in terms of CPU utilization.
     (eg. CPU utilization isn’t CPU efficiency, like theoretical FLOPs isn’t real
     application’s performance).                Saudi Arabia HPC, KAUST, Thuwal, 2012

More Related Content

What's hot

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
Kohei KaiGai
 
Implementing distributed mclock in ceph
Implementing distributed mclock in cephImplementing distributed mclock in ceph
Implementing distributed mclock in ceph
병수 박
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
Kohei KaiGai
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Kohei KaiGai
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
Kohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
Kohei KaiGai
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
Savith Satheesh
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
Jan Holčapek
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
Kohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Ural-PDC
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
Ganesan Narayanasamy
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
inside-BigData.com
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
Michael Zhang
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
Kohei KaiGai
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
Kohei KaiGai
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
Kohei KaiGai
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
inside-BigData.com
 

What's hot (20)

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Implementing distributed mclock in ceph
Implementing distributed mclock in cephImplementing distributed mclock in ceph
Implementing distributed mclock in ceph
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
 

Similar to Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
ssuser30e7d2
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
AMD Developer Central
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
Wim Vanderbauwhede
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
DataWorks Summit
 
Final_Report
Final_ReportFinal_Report
Final_Report
Connor Delaosa
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
NVIDIA Japan
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
LEGATO project
 
0507036
05070360507036
0507036
meraz rizel
 
QNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdfQNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdf
GustavoLippera1
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
inside-BigData.com
 
ate_full_paper
ate_full_paperate_full_paper
ate_full_paper
Rohit Gohil
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
NVIDIA Japan
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Red_Hat_Storage
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 

Similar to Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012 (20)

PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Final_Report
Final_ReportFinal_Report
Final_Report
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
0507036
05070360507036
0507036
 
QNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdfQNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdf
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
ate_full_paper
ate_full_paperate_full_paper
ate_full_paper
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 

Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

  • 1. Do theoretical FLOPs matter for real application’s performance ? By Joshua.Mora@amd.com Abstract: The most intelligent answer to this question is “it depends on the application”. To proof that, we will show a few examples from both theoretical and practical point of view. In order to validate experimentally it, a modified AMD processor named “Fangio” (AMD Opteron 6275 Processor) will be used which has limited floating point capability to 2 FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the performance of AMD Opteron 6276 Processor with 4 times more floating point capability , ie. 8 FLOPs/clk/BD unit. The intention of this work is threefold: i) to demonstrate that the FLOPs/clk/core of microprocessor architectures isn’t necessarily a good performance metric indicator despite it is heavily used by the industry (eg. HPL). ii) to expose that code vectorization technology of compilers is fundamental in order to extract as much as possible real application performance but it has a long way to go in extracting it. iii) It would not be fair to blame exclusively on compiler technology: algorithms are not well designed and written for the compilers to exploit vector instructions (ie. SSE, AVX and FMA). Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 2. Agenda • Concepts – Kinds of FLOPs/clk – Single Instruction Single Data , Single Instruction Multiple Data • AMD Interlagos processor FPU – FPU, see Understanding Interlagos arch through HPL (HPC Advisory Council workshop, ISC 2012) – Roofline model • AMD Fangio processor – FPU capping, roofline model • Results/Conclusions within roofline model for Interlagos and Fangio. – Benchmarks: HPL, stream, CFD apps, SPEC fp benchmarks. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 3. Concepts: kinds of FLOPs/clk • A brief list of floating point instructions and examples supported by AMD Interlagos • Scalar, Packed • SP: Single precision, DP: Double precision Ins. Type Examples FLOPs/clk Reg. size X87 FADD , FMUL 1 32,64bits SSE (SP) Scalar: ADDSS , MULSS Packed: ADDPS,MULPS 8 128bits SSE2 (DP) Scalar: ADDSD, MULSD Packed: ADDPD,MULPD 4 128bits AVX (SP) Scalar: VADDSS, MULSS Packed: VADDPS, VMULPS 4, 8 128,256b AVX (DP) Scalar: ADDSD, MULSD Packed: VADDPD, VMULPD 2, 4 128,256b FMA4 (SP) Scalar: VFMADDSS Packed: VFMADDPS 8, 16 128,256b FMA4 (DP) Scalar: VFMADDSD Packed: VFMADDPD 4, 8 128,256b Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 4. SISD: Single Instruction Single Data SIMD: Single Instruction Multiple Data Current CPU cores can crunch 8 DP numbers at a time. GPUs streaming cores can crunch 2-4DP numbers. There are several thousand streaming cores per GPU. Bubbles (no work) streams of inputs single data of data and results, input and stored in vectors or result at each packed format, clock, SSE, AVX, FMA stored in scalar format 1 clock SIMD allows processing of more data but it needs to be formated / packed to fit vectors. THAT IS THE CHALLENGE Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 5. Few slides from presentation: Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 6. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 7. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 8. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 9. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 10. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 11. Roof line for AMD Interlagos System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3. Real GFLOP/s in double precision (Linpack benchmark) 2procs x 8core-pairs x 2.3GHz x 8 DP FLOP/clk/core-pair x 0.85eff = 250 DP GF/s Very high numerical intensity ie. (FLOP/s) / (Byte/s) -use of AMD Core Math Library (FMA4 instructions) -cache friendly -reusability of data -DGEMM is L3 BLAS with Arithmetic Intensity order N (problem size) Real GB/s: 72 GB/s (stream benchmark) Low numerical intensity -use of non temporal stores (use of write combined buffer instead of evicting data into L2 -> L3 -> RAM to speed up the write into RAM.) -not cache friendly -no reusage of data -most of the time cores waiting for data (low FLOP/clk despite using SSE2,FMA4) - stream is L1 BLAS with Arithmetic Intensity order 1 (size independent) Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 12. AMD Fangio, FPU capping • Fangio is Interlagos processor model 6276 but has capped FPU from 8 DP FLOP/clk to 2 DP FLOP/clk by slowing down FPU retirement of instructions. • Allows same instruction architecture set as Interlagos. • System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3. Performance impact depends on workload: Real GFLOP/s in double precision (Linpack benchmark) 2procs x 8core-pairs x 2.3GHz x 2 DP FLOP/clk/core-pair x 0.85eff = 75 DP GF/s Real GB/s: 72 GB/s (stream benchmark) unmodified memory throughput performance ! Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 13. HPL runs to confirm FPU capping 2P 16cores @ 2.3GHz (6276 Interlagos) ============================================================================== T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR01R2L4 86400 100 4 8 1774.55 2.423e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022068 ...... PASSED ============================================================================== 2P 16cores @ 2.3GHz (6275 Interlagos) Fangio 3x longer time ============================================================================== T/V N NB P Q Time Gflops 78.26GF/s -------------------------------------------------------------------------------- (16CU*2GF/clk/CU*2.3GHz) WR01R2L4 86400 100 4 8 5494.39 7.826e+01 78.26 GF/s 106% HPL eff !! -------------------------------------------------------------------------------- 73.6 GF/s ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022316 ...... PASSED ============================================================================== Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 14. Stream runs on 6275 to confirm no drop in memory throughput ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 73089.4045 0.0443 0.0438 0.0449 Scale: 68952.3038 0.0469 0.0464 0.0472 Add: 66289.3072 0.0729 0.0724 0.0734 Triad: 66301.0957 0.0730 0.0724 0.0734 ------------------------------------------------------------- Scale, Add and Triad do FLOPS in double precision. Triad is plotted in roofline model since it is the one with highest FLOPs associated with operations: add and multiply using FMA4. #pragma omp parallel for for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 15. Summary of measurements per node plotted in roofline model Not the effective freq due to boost. (2*7.8)/(2.3GHz*8) = 85% eff (2*2.3)/(2.3GHz*2) = 100% eff !! Workload Interlagos 6276 Interlagos 6275 “Fangio” GB/s DP GF/s AI= F/B GB/s DP GF/s AI= F/B HPL =6*4=24 =7.8*32=250 10.6 =1.8*4=6.8 =2.3*32=75 11.7 STR. TRIAD =17*4=68 =0.5*32=16 0.08 =17*4=68 =0.5*32=16 0.08 OPENFOAM =15*4=60 =0.8*32=25 0.41 =14*4=56 =0.7*32=22 0.39 • 1 computer has 2 processors with a total of 4 numanodes and 32 cores in 16 compute units • 1 numanode has a total of 4 compute units. • Memory bandwidth in GB/s is measured per numanode. (in red) • Double precision floating point is measured per core.(in red) Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 16. Roofline for Interlagos and Fangio GF/s (log 2 scale) Measured and plotted AMD Interlagos (250GF/s) 256 Both processors have same memory bandwidth, HPL (Interlagos) Ie. same BW slope. L3 BLAS (eg. Dgemm), 128 benefits from vectorization: FMA, AVX, SSE AMD Fangio (75GF/s) 75% perf drop 64 HPL (Fangio) 32 SPECfp 3-20%, 8% average perf drop 16 Sparse algebra such as CFD apps ~ 6-8% perf drop OpenFOAM, FLUENT, STARCCM,.. TRIAD 8 0% perf drop Data dependencies, scalar code, no benefits from vectorization 0.125 0.25 0.5 1 2 4 8 16 32 (GF/s)/(GB/s) Arithmetic intensity (log 2 scale) Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 17. Performance impact on SPEC fp 2006 rate peak Resource utilization • SPEC website link: www.spec.org • Runs done with peak flags configuration in order to utilize optimally compiler technology. • In this case Open64 compiler has been used. • Runs were done with only 1 copy per Bulldozer unit, to allow each process/copy to fully utilize available computing resources without constrains originated from shared resources in the Bulldozer compute unit (eg. L2, FPU , instruction scheduler). Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 18. Performance impact on SPEC fp 2006 rate peak (cont) Benchmark Brief % perf. Application area name description drop Bwaves Fluid Dynamics 3D transonic transient laminar viscous flow. 0.09% self-consistent field calculations are Gamess Quantum performed using the Restricted Hartree -10.51% Chemistry. Fock method Milc Quantum A gauge field generating program for lattice 0.10% Chromodynamics gauge theory programs with dynamical quarks. Zeusmp NCSA code, CFD simulation of astrophysical Fluid Dynamics -7.47% phenomena. Biochemistry / Gromacs Newtonian equations of motion for GPU Molecular -32.17% hundreds to millions of particle. candidate Dynamics Solves the Einstein evolution equations using a cactusADM General Relativity -2.01% staggered-leapfrog numerical method Leslie3d Fluid Dynamics CFD, Large Eddy Simulation -0.44% GPU Biology / Molecular Large biomolecular systems. The test case Namd -24.23% Dynamics candidate has 92,224 atoms of apolipoprotein A-I. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 19. Benchmark Application area Brief Description % perf. drop Finite Element adaptive finite elements and error estimation. dealII -9.09% Analysis Helmholtz-type equation. Linear simplex algorithm and sparse linear algebra. Soplex Programming, Test cases include railroad planning and 1.86% Optimization military airlift models. Image rendering. The testcase is a Povray Image Ray-tracing -12.15% 1280x1024 anti-aliased image of a landscape. Structural GPU Finite element code for linear and Calculix -26.82% Mechanics candidate nonlinear 3D structural applications. Computational Solves the Maxwell equations in 3D using the GemsFDTD -0.67% Electromagnetics finite-difference time-domain (FDTD) method. molecular Hartree-Fock wavefunction Quantum Tonto calculation to better match experimental -14.43% Chemistry X-ray diffraction data. "Lattice-Boltzmann Method" to simulate Lbm Fluid Dynamics 0.58% incompressible fluids in 3D Weather modeling from scales of meters Wrf Weather -0.95% to thousands of kilometers. speech recognition system from Carnegie Sphinx3 Speech recognition -3.00% Mellon University AVERAGE REAL PERFORMANCE DROP WHEN REDUCED 75% THE THEORETICAL FLOPs -8.94% Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 20. most Performance impact on CFD apps • Most of CFD apps with Eulerian formulation use sparse linear algebra to represent the linearized Navier-Stokes equations on non structured grids. • The higher the discretization schemes, the higher the arithmetic intensity • Data dependencies in both spatial and time prevent vectorization • Large datasets have low cache reutilization. • Cores are waiting most of the time to get new data into caches. • Once data is on the caches, the floating point instructions are mostly scalar instead of packed. • Compilers have hard time in finding opportunities to vectorize loops. • Loop unrolling and partial vectorization of independent data help very little due to cores waiting to get that data. • Overall, low performance from FLOPs/s point of view. • Therefore, capping FPU in terms of FLOPs/clk does not impact on application’s performance. • Theoretical FLOP/s isn’t therefore a good indicator of how applications such as CFD ones (and many more) will perform. Saudi Arabia HPC, KAUST, Thuwal, 2012
  • 21. What should we do, moving forward ? • Multidisciplinary teams to work on – Algorithm research and development to make it more hardware aware. – Software research and development to implement efficiently the algorithms. (eg. Comm avoidance, dynamic task scheduling, work stealing, locality, power aware, resilient ,…) – Interaction between scientist and computer (HW+SW) scientist to develop new formulations of equations that will deliver algorithms better suited for new computer architectures. – Research and development on compiler and progr. language technology to detect algorithm properties and exploit hardware features. • Supercomputing datacenter institutions to work on – Enabling science by proper exploitation of computational resources. – Multidisciplinary teams educating scientist on how to use the resources. – Supercomputing investments should be funded and measured in terms of number and quality of scientific projects, not in terms of CPU utilization. (eg. CPU utilization isn’t CPU efficiency, like theoretical FLOPs isn’t real application’s performance). Saudi Arabia HPC, KAUST, Thuwal, 2012