Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Do theoretical FLOPs matter for
real application’s performance ?
By Joshua.Mora@amd.com
Abstract: The most intelligent answer to this question is “it depends on
the application”. To proof that, we will show a few examples from both
theoretical and practical point of view. In order to validate experimentally
it, a modified AMD processor named “Fangio” (AMD Opteron 6275
Processor) will be used which has limited floating point capability to 2
FLOPs/clk/BD unit, delivering less (-8% in avergage) but close to the
performance of AMD Opteron 6276 Processor with 4 times more floating
point capability , ie. 8 FLOPs/clk/BD unit.
The intention of this work is threefold: i) to demonstrate that the
FLOPs/clk/core of microprocessor architectures isn’t necessarily a good
performance metric indicator despite it is heavily used by the industry (eg.
HPL). ii) to expose that code vectorization technology of compilers is
fundamental in order to extract as much as possible real application
performance but it has a long way to go in extracting it. iii) It would not be
fair to blame exclusively on compiler technology: algorithms are not well
designed and written for the compilers to exploit vector instructions (ie.
SSE, AVX and FMA).
Saudi Arabia HPC, KAUST, Thuwal, 2012

Agenda
• Concepts
– Kinds of FLOPs/clk
– Single Instruction Single Data , Single Instruction Multiple Data
• AMD Interlagos processor FPU
– FPU, see Understanding Interlagos arch through HPL (HPC
Advisory Council workshop, ISC 2012)
– Roofline model
• AMD Fangio processor
– FPU capping, roofline model
• Results/Conclusions within roofline model for Interlagos and
Fangio.
– Benchmarks: HPL, stream, CFD apps, SPEC fp benchmarks.


Concepts: kinds of FLOPs/clk
• A brief list of floating point instructions and examples
supported by AMD Interlagos
• Scalar, Packed
• SP: Single precision, DP: Double precision
Ins. Type Examples FLOPs/clk Reg. size

X87 FADD , FMUL 1 32,64bits
SSE (SP) Scalar: ADDSS , MULSS Packed: ADDPS,MULPS 8 128bits
SSE2 (DP) Scalar: ADDSD, MULSD Packed: ADDPD,MULPD 4 128bits
AVX (SP) Scalar: VADDSS, MULSS Packed: VADDPS, VMULPS 4, 8 128,256b
AVX (DP) Scalar: ADDSD, MULSD Packed: VADDPD, VMULPD 2, 4 128,256b
FMA4 (SP) Scalar: VFMADDSS Packed: VFMADDPS 8, 16 128,256b
FMA4 (DP) Scalar: VFMADDSD Packed: VFMADDPD 4, 8 128,256b


SISD: Single Instruction Single Data
SIMD: Single Instruction Multiple Data
Current CPU cores can crunch 8 DP numbers at a time. GPUs streaming cores
can crunch 2-4DP numbers. There are several thousand streaming cores per
GPU. Bubbles
(no work)

streams of inputs
single data
of data and results,
input and
stored in vectors or
result at each
packed format,
clock,
SSE, AVX, FMA
stored in
scalar format 1 clock

SIMD allows processing of more data but it needs to
be formated / packed to fit vectors. THAT IS THE
CHALLENGE

Few slides from presentation:


Roof line for AMD Interlagos
System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 8 DP FLOP/clk/core-pair x 0.85eff =
250 DP GF/s
Very high numerical intensity ie. (FLOP/s) / (Byte/s)
-use of AMD Core Math Library (FMA4 instructions)
-cache friendly
-reusability of data
-DGEMM is L3 BLAS with Arithmetic Intensity order N (problem size)
Real GB/s: 72 GB/s (stream benchmark)
Low numerical intensity
-use of non temporal stores (use of write combined buffer instead of evicting
data into L2 -> L3 -> RAM to speed up the write into RAM.)
-not cache friendly
-no reusage of data
-most of the time cores waiting for data (low FLOP/clk despite using SSE2,FMA4)
- stream is L1 BLAS with Arithmetic Intensity order 1 (size independent)

AMD Fangio, FPU capping
• Fangio is Interlagos processor model 6276 but has capped FPU
from 8 DP FLOP/clk to 2 DP FLOP/clk by slowing down FPU
retirement of instructions.
• Allows same instruction architecture set as Interlagos.
• System: 2P, 2.3GHz, 16 cores, 1600MHz DDR3.

Performance impact depends on workload:
Real GFLOP/s in double precision (Linpack benchmark)
2procs x 8core-pairs x 2.3GHz x 2 DP FLOP/clk/core-pair x 0.85eff =
75 DP GF/s
Real GB/s: 72 GB/s (stream benchmark)
unmodified memory throughput performance !

HPL runs to confirm FPU capping
2P 16cores @ 2.3GHz (6276 Interlagos)
==============================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01R2L4 86400 100 4 8 1774.55 2.423e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022068 ...... PASSED
==============================================================================

2P 16cores @ 2.3GHz (6275 Interlagos) Fangio 3x longer time
==============================================================================
T/V N NB P Q Time Gflops
78.26GF/s
--------------------------------------------------------------------------------
(16CU*2GF/clk/CU*2.3GHz)
WR01R2L4 86400 100 4 8 5494.39 7.826e+01 78.26 GF/s 106% HPL eff !!
-------------------------------------------------------------------------------- 73.6 GF/s
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0022316 ...... PASSED
==============================================================================

Stream runs on 6275 to confirm
no drop in memory throughput
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 73089.4045 0.0443 0.0438 0.0449
Scale: 68952.3038 0.0469 0.0464 0.0472
Add: 66289.3072 0.0729 0.0724 0.0734
Triad: 66301.0957 0.0730 0.0724 0.0734
-------------------------------------------------------------
Scale, Add and Triad do FLOPS in double precision.
Triad is plotted in roofline model since it is the one with highest
FLOPs associated with operations: add and multiply using FMA4.
#pragma omp parallel for
for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j];

Summary of measurements per node
plotted in roofline model Not the effective freq
due to boost.
(2*7.8)/(2.3GHz*8) = 85% eff (2*2.3)/(2.3GHz*2) = 100% eff !!

Workload Interlagos 6276 Interlagos 6275 “Fangio”
GB/s DP GF/s AI= F/B GB/s DP GF/s AI= F/B
HPL =6*4=24 =7.8*32=250 10.6 =1.8*4=6.8 =2.3*32=75 11.7
STR. TRIAD =17*4=68 =0.5*32=16 0.08 =17*4=68 =0.5*32=16 0.08
OPENFOAM =15*4=60 =0.8*32=25 0.41 =14*4=56 =0.7*32=22 0.39

• 1 computer has 2 processors with a total of 4 numanodes and
32 cores in 16 compute units
• 1 numanode has a total of 4 compute units.
• Memory bandwidth in GB/s is measured per numanode. (in red)
• Double precision floating point is measured per core.(in red)


Roofline for Interlagos and Fangio
GF/s (log 2 scale) Measured and plotted

AMD Interlagos (250GF/s)
256 Both processors have same
memory bandwidth, HPL (Interlagos)
Ie. same BW slope. L3 BLAS (eg. Dgemm),
128 benefits from vectorization: FMA, AVX, SSE
AMD Fangio (75GF/s) 75% perf drop
64
HPL (Fangio)
32 SPECfp 3-20%, 8% average perf drop

16 Sparse algebra such as CFD apps ~ 6-8% perf drop
OpenFOAM, FLUENT, STARCCM,..
TRIAD
8 0% perf drop
Data dependencies, scalar code, no benefits from vectorization

0.125 0.25 0.5 1 2 4 8 16 32
(GF/s)/(GB/s) Arithmetic intensity (log 2 scale)

Performance impact on
SPEC fp 2006 rate peak Resource utilization

• SPEC website link: www.spec.org
• Runs done with peak flags
configuration in order to utilize
optimally compiler technology.
• In this case Open64 compiler has
been used.
• Runs were done with only 1 copy
per Bulldozer unit, to allow each
process/copy to fully utilize
available computing resources
without constrains originated from
shared resources in the Bulldozer
compute unit (eg. L2, FPU ,
instruction scheduler).


Performance impact on
SPEC fp 2006 rate peak (cont)
Benchmark Brief % perf.
Application area
name description drop
Bwaves Fluid Dynamics 3D transonic transient laminar viscous flow. 0.09%
self-consistent field calculations are
Gamess Quantum
performed using the Restricted Hartree -10.51%
Chemistry.
Fock method
Milc Quantum A gauge field generating program for lattice
0.10%
Chromodynamics gauge theory programs with dynamical quarks.
Zeusmp NCSA code, CFD simulation of astrophysical
Fluid Dynamics -7.47%
phenomena.
Biochemistry /
Gromacs Newtonian equations of motion for GPU
Molecular -32.17%
hundreds to millions of particle. candidate
Dynamics
Solves the Einstein evolution equations using a
cactusADM General Relativity -2.01%
staggered-leapfrog numerical method
Leslie3d Fluid Dynamics CFD, Large Eddy Simulation -0.44%
GPU
Biology / Molecular Large biomolecular systems. The test case
Namd -24.23%
Dynamics candidate
has 92,224 atoms of apolipoprotein A-I.

Benchmark Application area
Brief Description % perf. drop
Finite Element adaptive finite elements and error estimation.
dealII -9.09%
Analysis Helmholtz-type equation.
Linear simplex algorithm and sparse linear algebra.
Soplex Programming, Test cases include railroad planning and 1.86%
Optimization military airlift models.
Image rendering. The testcase is a
Povray Image Ray-tracing -12.15%
1280x1024 anti-aliased image of a landscape.
Structural GPU
Finite element code for linear and
Calculix -26.82%
Mechanics candidate
nonlinear 3D structural applications.
Computational Solves the Maxwell equations in 3D using the
GemsFDTD -0.67%
Electromagnetics finite-difference time-domain (FDTD) method.
molecular Hartree-Fock wavefunction
Quantum
Tonto calculation to better match experimental -14.43%
Chemistry
X-ray diffraction data.
"Lattice-Boltzmann Method" to simulate
Lbm Fluid Dynamics 0.58%
incompressible fluids in 3D
Weather modeling from scales of meters
Wrf Weather -0.95%
to thousands of kilometers.
speech recognition system from Carnegie
Sphinx3 Speech recognition -3.00%
Mellon University
AVERAGE REAL PERFORMANCE DROP WHEN
REDUCED 75% THE THEORETICAL FLOPs -8.94%

most
Performance impact on CFD apps
• Most of CFD apps with Eulerian formulation use sparse linear algebra to
represent the linearized Navier-Stokes equations on non structured grids.
• The higher the discretization schemes, the higher the arithmetic intensity
• Data dependencies in both spatial and time prevent vectorization
• Large datasets have low cache reutilization.
• Cores are waiting most of the time to get new data into caches.
• Once data is on the caches, the floating point instructions are mostly
scalar instead of packed.
• Compilers have hard time in finding opportunities to vectorize loops.
• Loop unrolling and partial vectorization of independent data help very
little due to cores waiting to get that data.
• Overall, low performance from FLOPs/s point of view.
• Therefore, capping FPU in terms of FLOPs/clk does not impact on
application’s performance.
• Theoretical FLOP/s isn’t therefore a good indicator of how
applications such as CFD ones (and many more) will perform.

What should we do, moving forward ?
• Multidisciplinary teams to work on
– Algorithm research and development to make it more hardware aware.
– Software research and development to implement efficiently the
algorithms. (eg. Comm avoidance, dynamic task scheduling, work stealing,
locality, power aware, resilient ,…)
– Interaction between scientist and computer (HW+SW) scientist to develop
new formulations of equations that will deliver algorithms better suited
for new computer architectures.
– Research and development on compiler and progr. language technology
to detect algorithm properties and exploit hardware features.
• Supercomputing datacenter institutions to work on
– Enabling science by proper exploitation of computational resources.
– Multidisciplinary teams educating scientist on how to use the resources.
– Supercomputing investments should be funded and measured in terms of
number and quality of scientific projects, not in terms of CPU utilization.
(eg. CPU utilization isn’t CPU efficiency, like theoretical FLOPs isn’t real
application’s performance). Saudi Arabia HPC, KAUST, Thuwal, 2012

Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012

Similar to Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012 (20)

Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012