The document discusses performance analysis of 3D finite difference computational stencils on Seamicro fabric compute systems. It provides an overview of the hardware including chassis, compute cards, storage cards, and 3D torus fabric topology. It then describes the software stack and various microbenchmarks performed, including CPU, memory, network and storage benchmarks. It also describes modeling of 3D Laplace's equation using an 8th order finite difference scheme and its discretization over a 25 point stencil for computation on the system.
CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora
1. PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS
ON SEAMICRO FABRIC COMPUTE SYSTEMS
JOSHUA MORA
2. ABSTRACT
Seamicro fabric compute systems offers an array of low power compute nodes interconnected
with a 3D torus network fabric (branded Freedom Supercomputer Fabric).
This specific network topology allows very efficient point to point communications where only
your neighbor compute nodes are involved in the communications.
Such type of communication pattern arises in a wide variety of distributed memory applications
like in 3D Finite Difference computational stencils, present on many computationally expensive
scientific applications (eg. seismic, computational fluid dynamics).
We present the performance analysis (computation, communication, scalability) of a generic 3D
Finite Difference computational stencil on such a system.
We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for
HPC applications that exhibit this communication pattern.
2
5. HW OVERVIEW
CHASSIS: SIDE VIEW
5
Total of
8 storage cards
x 8 drives each plugged at the front
Total of
4 (quadrants)
x 16 compute cards plugged at both sides
6. HW OVERVIEW
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8
RAM
AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz
CPU
FB 1
6
FB 2
PCI
chipset
FB 8
7. HW OVERVIEW
Core 6
Core 7
2.0GHz core frequency
Core 4
Core 5
8 “Piledriver” cores, AVX, FMA3/4
Core 2
Core 3
AMD OpteronTM 4365EE processor
Core 0
Core 1
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8
L2
L2
L2
L2
40W TDP
Northbridge
HT
PHY
Max Turbo core frequency up to 3.1GHz
DRAM CTL
L3 cache
2 memory channels
7
nCHT
8. HW OVERVIEW
STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS
Support for RAID and non RAID
8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB,
Or 8 SSD drives, 80GB-2TB
System can operate without disks
CPU
FB 1
Disk1
H
8
FB 2
Disk2
FB 2
RAM
FB 1
FB 8
PCI
chipset
SATA
FB 8
Disk8
9. HW DESCRIPTION/OVERVIEW
MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE
2 x 10Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 2 x 10GbE SFP+
‒ External Port Bandwidth:
‒ 20 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
‒
32 Gbps Full Duplex.
8 x 1Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 8x 1GbE 1000BaseT
‒ External Port Bandwidth:
‒
8 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
9
‒
32 Gbps Full Duplex.
10. HW OVERVIEW
FABRIC TOPOLOGY: 3D TORUS
3D torus network fabric
8 x 8 x 8 Fabric nodes
Diameter (max hop) 4 + 4 + 4 = 12
Theor. cross section bandwidth =
2 (periodic) x 8 x 8 (section) x
2(bidir) x 2.0Gbps/link = 512Gb/s
Compute, storage, mgmt cards
are plugged into the network fabric.
Support for hot plugged compute cards.
10
14. MICRO-BENCHMARKS
CPU, POWER
Benchmark HPL, leveraging FMA4,3
Single Compute card
‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card
‒ 40W TDP processor, 60W per compute card running HPL
=========================================================================
T/V
N NB P Q
Time
Gflops
-------------------------------------------------------------------------------WR01L2L4
40000 100 2 4
795.23
5.366e+01 (83% efficiency)
-------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=
0.0033267 ...... PASSED
=========================================================================
Chassis with 64 compute cards
‒ 2.95 DP Teraflops/sec/ chassis
‒ 72% HPL Efficiency/ chassis (MPI over Ethernet)
‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans).
14
15. MICRO-BENCHMARKS
MEMORY, POWER
Benchmark STREAM
Single Compute card @ 1333MHz memory frequency
‒ 15GB/s
Function Best Rate MB/s Avg time Min time Max time
Copy:
14647.1 0.181456 0.181333 0.181679
Scale:
15221.7 0.175883 0.175615 0.176168
Add:
14741.2 0.270838 0.270557 0.271005
Triad:
15105.2 0.269939 0.269585 0.270251
‒ Power 15 W idle per card.
‒ Power 30 W stream per card.
STREAM chassis
‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis.
‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans).
15
16. MICRO-BENCHMARKS
NIC TUNING
Ethernet related tuning:
-
Ethernet Driver, 8.0.35-NAPI
- InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options)
- MTU 9000 (ifconfig)
Interrupt balance fabric nodes to different cores.
MPI TCP tuning
-
-mca btl_tcp_if_include eth0,eth2,…eth6,eth7
- -mca btl_tcp_eager_limit 1mb (default is 64kb)
UPC tuning
‒ using UDP instead of MPI+TCP
16
17. MICRO-BENCHMARKS
NIC TUNING
MPI related tuning:
8 Ethernet networks, one per fabric node across all 64 compute cards.
OpenMPI, with Ethernet, TCP, defaults to use all networks.
Can be restricted with arguments passed to mpirun command or in
openmpi.conf file
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7
Point to Point communications
Latency: 30-36 usec
Bandwidth: linear scaling from 1 to 8 fabric nodes
1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional
8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional
17
18. MICRO-BENCHMARKS
NETWORK PERFORMANCE
FB 1
FB 1
FB 2
FB 2
CPU
FB 1
FB 2
PCI
chipset
FB 8
FB 8
FB 8
RAM
FB 8
PCI
chipset
RAM
CPU
FB 1
FB 2
Point 2 Point benchmark setup
Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8
between any 2 compute cards in the chassis.
18
20. MICRO-BENCHMARKS
NETWORK PERFORMANCE
FB 1
FB 1
FB 1
FB 2
FB 2
FB 2
c2
FB 1
c0
FB 2
Message rate setup:
c4
c2
FB 8
FB 8
FB 8
c4
FB 8
c6
c0
c6
Every other core sending messages through each fabric node to another core on
another compute card.
20
4 pairs of MPI processes sending data striped across the 8 fabric nodes until
maxing out bandwidth of the fabric.
28. APPLICATION
COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION
Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less
than 5 to 7: U, V, W, P, T, k, e)
Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core.
25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~
30 vectors (in 3D).
1 Single Precision (SP) FLOAT is 4bytes.
3
1GB∗(1SPF/4B)/30eq ≈ 200 points in each direction per core.
Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points
Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of
4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning.
200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear
system at every computation of 200x200x200 points.
200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache.
28
29. APPLICATION
IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO
Advantages when using high order schemes:
‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy.
‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of
vector instructions. Implementation dependent. Otherwise extremely memory bound.
‒Better network bandwidth utilization due to larger message size of halo at higher order
scheme.
‒ Tradeoff: Higher communication volume for higher order scheme
‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks)
for neighboring communications.
‒Larger messages provide more chances to overlap communication with computation.
‒More network latency tolerant.
29
30. APPLICATION
P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE
8 cores per compute card.
Multithreaded computation with OpenMP threads.
Threading only in k loops (i,j,k)
In general case, 6 exchanges (gray area)
Top
North
with neighbor compute cards
halo (message size) of 200x200x4
Best HW mapping 1x8x8 partitions
East
West
‒ No partition in X fabric nodes
‒ 8 partitions in Y fabric nodes
‒ 8 partitions in Z fabric nodes
Best algorithm mapping 4x4x4
‒ Less exchange than 1x8x8
4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8
South
Bottom
30
31. APPLICATION
ITERATIVE ALGORITHM
Core 1
Solve linear system 1
Core 2
Solve linear system 1
Core 512 (full chassis)
Solve linear system 1
Solve linear system 2
Solve linear system 2
Solve linear system 2
Solve linear system 7
Solve linear system 7
Solve linear system 7
Solve linear system 8
Solve linear system 8
Solve linear system 8
Bidirectional Exchange
of halo/communication with
neighbor domain hosted by
another core/processor/compute card
31
Overall convergence ?
Yes
End
Not yet
Linear system
Checkpointing
equations
values
32. APPLICATION
PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION
Compute card with 1 CPU=1 NUMAnode.
No chance for NUMA misses.
Easy to leverage openMP within MPI code without having to worry about remote memory
accesses.
Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet.
3 Compute units can max-out memory controller bandwidth. (plenty of computing capability)
1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with
computation stages.
Single core per CPU for MPI communications:
‒ Aggregating halo data for all threads to send more data per message.
‒ leveraging MPI non blocking communications for halo exchange.
‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth
(0.6MB/message)
‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction.
32
33. APPLICATION
PERFORMANCE SUMMARY
Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision
(no cheating with weak scaling)
Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling
all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis.
Computation
Mcells/
Sec
Speed
up
Wrt 8
cards
Efficiency
Wrt 8
cards
8
273
8
16
536
32
64
Comm overhead
wrt total time
Halo
exchange
5.5%
6%
15.7
98.1%
5.5%
7%
1065
31.2
97.5%
5.7%
8%
2048
60.0
93.7%
5.8%
11%
64
reduction
100%
3DFD speed up on Seamicro
No change in total volume exchanged
when increased card count, constant comm overhead.
33
Speed up
#
compute
cards
32
Speed up
16
Ideal
8
8
16
32
64
# compute cards
Expected increase, when increased card count,
as shown in Allreduce micro benchmark
34. CONCLUSIONS
Proven suitability (ie. scalability) for 3D finite difference stencil computations,
leveraging latest software programming paradigms (MPI + openMP).
‒This is a proxy for many other High Performance Computing applications with
similar computational requirements (eg. Manufacturing, Oil and Gas,
Weather...)
System Advantages:
‒High computing density (performance and performance per Watt) in 10U form
factor
‒Per compute/storage card
‒Scalability provided through Seamicro fabric
‒High flexibility in compute, network, storage configurations adjusted to your
workload requirements as demonstrated in this application.
34