Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric). This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications. Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics). We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system. We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern.
Ensuring Technical Readiness For Copilot in Microsoft 365
Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems
1. PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS ON SEAMICRO FABRIC COMPUTE SYSTEMS
JOSHUA MORA
2. 2
ABSTRACT
Seamicro fabric compute systems offers an array of low power compute nodes interconnected
with a 3D torus network fabric (branded Freedom Supercomputer Fabric).
This specific network topology allows very efficient point to point communications where only
your neighbor compute nodes are involved in the communications.
Such type of communication pattern arises in a wide variety of distributed memory applications
like in 3D Finite Difference computational stencils, present on many computationally expensive
scientific applications (eg. seismic, computational fluid dynamics).
We present the performance analysis (computation, communication, scalability) of a generic 3D
Finite Difference computational stencil on such a system.
We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for
HPC applications that exhibit this communication pattern.
4. 4
HW OVERVIEW
CHASSIS: FRONT AND BACK VIEWS
FRONT BACK
5. 5
HW OVERVIEW
CHASSIS: SIDE VIEW
Total of
4 (quadrants)
x 16 compute cards plugged at both sides
Total of
8 storage cards
x 8 drives each plugged at the front
6. 6
HW OVERVIEW
AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8
CPU
RAM
PCI
chipset
FB 1 FB 2 FB 8
7. 7
HW OVERVIEW
AMD OpteronTM 4365EE processor
8 “Piledriver” cores, AVX, FMA3/4
2.0GHz core frequency
Max Turbo core frequency up to 3.1GHz
40W TDP
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8
Northbridge
Core 0
L3 cache
Core 1
L2
Core 2
Core 3
L2
Core 4
Core 5
L2
Core 6
Core 7
L2
HT
PHY
nCHT
DRAM CTL
2 memory channels
8. 8
HW OVERVIEW
Support for RAID and non RAID
8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB,
Or 8 SSD drives, 80GB-2TB
System can operate without disks
STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS
CPU
RAM
PCI
chipset
FB 1 FB 2 FB 8
H
SATA
Disk1 Disk2 Disk8
FB 1 FB 2 FB 8
9. 9
HW DESCRIPTION/OVERVIEW
2 x 10Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 2 x 10GbE SFP+
‒ External Port Bandwidth:
‒ 20 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
‒ 32 Gbps Full Duplex.
8 x 1Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 8x 1GbE 1000BaseT
‒ External Port Bandwidth:
‒ 8 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
‒ 32 Gbps Full Duplex.
MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE
10. 10
HW OVERVIEW
FABRIC TOPOLOGY: 3D TORUS
3D torus network fabric
8 x 8 x 8 Fabric nodes
Diameter (max hop) 4 + 4 + 4 = 12
Theor. cross section bandwidth =
2 (periodic) x 8 x 8 (section) x
2(bidir) x 2.0Gbps/link = 512Gb/s
Compute, storage, mgmt cards
are plugged into the network fabric.
Support for hot plugged compute cards.
14. 14
MICRO-BENCHMARKS
CPU, POWER
Benchmark HPL, leveraging FMA4,3
Single Compute card
‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card
‒ 40W TDP processor, 60W per compute card running HPL
=========================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01L2L4 40000 100 2 4 795.23 5.366e+01 (83% efficiency)
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033267 ...... PASSED
=========================================================================
Chassis with 64 compute cards
‒ 2.95 DP Teraflops/sec/ chassis
‒ 72% HPL Efficiency/ chassis (MPI over Ethernet)
‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans).
15. 15
MICRO-BENCHMARKS
MEMORY, POWER
Benchmark STREAM
Single Compute card @ 1333MHz memory frequency
‒ 15GB/s
Function Best Rate MB/s Avg time Min time Max time
Copy: 14647.1 0.181456 0.181333 0.181679
Scale: 15221.7 0.175883 0.175615 0.176168
Add: 14741.2 0.270838 0.270557 0.271005
Triad: 15105.2 0.269939 0.269585 0.270251
‒ Power 15 W idle per card.
‒ Power 30 W stream per card.
STREAM chassis
‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis.
‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans).
16. 16
MICRO-BENCHMARKS
NIC TUNING
Ethernet related tuning:
- Ethernet Driver, 8.0.35-NAPI
- InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options)
- MTU 9000 (ifconfig)
Interrupt balance fabric nodes to different cores.
MPI TCP tuning
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7
- -mca btl_tcp_eager_limit 1mb (default is 64kb)
UPC tuning
‒ using UDP instead of MPI+TCP
17. 17
MICRO-BENCHMARKS
NIC TUNING
MPI related tuning:
8 Ethernet networks, one per fabric node across all 64 compute cards.
OpenMPI, with Ethernet, TCP, defaults to use all networks.
Can be restricted with arguments passed to mpirun command or in
openmpi.conf file
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7
Point to Point communications
Latency: 30-36 usec
Bandwidth: linear scaling from 1 to 8 fabric nodes
1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional
8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional
18. 18
MICRO-BENCHMARKS
NETWORK PERFORMANCE
Point 2 Point benchmark setup
Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8
between any 2 compute cards in the chassis.
FB 1 FB 2 FB 8
FB 1 FB 2 FB 8
CPU
RAM
PCI
chipset
FB 8 FB 2 FB 1
CPU
RAM
PCI
chipset
FB 1 FB 2 FB 8
20. 20
MICRO-BENCHMARKS
NETWORK PERFORMANCE
Message rate setup:
Every other core sending messages through each fabric node to another core on
another compute card.
4 pairs of MPI processes sending data striped across the 8 fabric nodes until
maxing out bandwidth of the fabric.
c0
c2
c4
c6
FB 1 FB 2 FB 8
FB 8 FB 2 FB 1
FB 1 FB 2 FB 8
FB 1 FB 2 FB 8
c0
c2
c4
c6
28. 28
Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less
than 5 to 7: U, V, W, P, T, k, e)
Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core.
25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~
30 vectors (in 3D).
1 Single Precision (SP) FLOAT is 4bytes.
1GB∗(1SPF/4B)/30eq 3 ≈ 200 points in each direction per core.
Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points
Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of
4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning.
200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear
system at every computation of 200x200x200 points.
200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache.
APPLICATION
COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION
29. 29
Advantages when using high order schemes:
‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy.
‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of
vector instructions. Implementation dependent. Otherwise extremely memory bound.
‒Better network bandwidth utilization due to larger message size of halo at higher order
scheme.
‒ Tradeoff: Higher communication volume for higher order scheme
‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks)
for neighboring communications.
‒Larger messages provide more chances to overlap communication with computation.
‒More network latency tolerant.
APPLICATION
IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO
30. 30
APPLICATION
P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE
West East
Top
Bottom
North
South
8 cores per compute card.
Multithreaded computation with OpenMP threads.
Threading only in k loops (i,j,k)
In general case, 6 exchanges (gray area)
with neighbor compute cards
halo (message size) of 200x200x4
Best HW mapping 1x8x8 partitions
‒ No partition in X fabric nodes
‒ 8 partitions in Y fabric nodes
‒ 8 partitions in Z fabric nodes
Best algorithm mapping 4x4x4
‒ Less exchange than 1x8x8
4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8
31. 31
APPLICATION
ITERATIVE ALGORITHM
Solve linear system 1
Solve linear system 2
Solve linear system 7
Solve linear system 8
Overall convergence ?
Solve linear system 1
Solve linear system 2
Solve linear system 7
Solve linear system 8
Solve linear system 1
Solve linear system 2
Solve linear system 7
Solve linear system 8
End
Core 1 Core 2 Core 512 (full chassis)
Yes
Not yet
Linear system equations
Checkpointing values
Bidirectional Exchange
of halo/communication with
neighbor domain hosted by
another core/processor/compute card
32. 32
Compute card with 1 CPU=1 NUMAnode.
No chance for NUMA misses.
Easy to leverage openMP within MPI code without having to worry about remote memory
accesses.
Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet.
3 Compute units can max-out memory controller bandwidth. (plenty of computing capability)
1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with
computation stages.
Single core per CPU for MPI communications:
‒ Aggregating halo data for all threads to send more data per message.
‒ leveraging MPI non blocking communications for halo exchange.
‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth
(0.6MB/message)
‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction.
APPLICATION
PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION
33. 33
Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision
(no cheating with weak scaling)
Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling
all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis.
APPLICATION
PERFORMANCE SUMMARY
#
compute
cards
Computation
Mcells/
Sec
Speed
up
Wrt 8
cards
Efficiency
Wrt 8
cards
Comm overhead
wrt total time
Halo
exchange
reduction
8 273 8 100% 5.5% 6%
16 536 15.7 98.1% 5.5% 7%
32 1065 31.2 97.5% 5.7% 8%
64 2048 60.0 93.7% 5.8% 11%
8
16
32
64
8 16 32 64
Speed up
# compute cards
3DFD speed up on Seamicro
Speed up
Ideal
No change in total volume exchanged
when increased card count, constant comm overhead.
Expected increase, when increased card count,
as shown in Allreduce micro benchmark
34. 34
CONCLUSIONS
Proven suitability (ie. scalability) for 3D finite difference stencil computations,
leveraging latest software programming paradigms (MPI + openMP).
‒This is a proxy for many other High Performance Computing applications with
similar computational requirements (eg. Manufacturing, Oil and Gas,
Weather...)
System Advantages:
‒High computing density (performance and performance per Watt) in 10U form
factor
‒Per compute/storage card
‒Scalability provided through Seamicro fabric
‒High flexibility in compute, network, storage configurations adjusted to your
workload requirements as demonstrated in this application.