SlideShare a Scribd company logo
1 of 35
Download to read offline
PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS ON SEAMICRO FABRIC COMPUTE SYSTEMS 
JOSHUA MORA
2 
ABSTRACT 
 Seamicro fabric compute systems offers an array of low power compute nodes interconnected 
with a 3D torus network fabric (branded Freedom Supercomputer Fabric). 
 This specific network topology allows very efficient point to point communications where only 
your neighbor compute nodes are involved in the communications. 
 Such type of communication pattern arises in a wide variety of distributed memory applications 
like in 3D Finite Difference computational stencils, present on many computationally expensive 
scientific applications (eg. seismic, computational fluid dynamics). 
 We present the performance analysis (computation, communication, scalability) of a generic 3D 
Finite Difference computational stencil on such a system. 
 We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for 
HPC applications that exhibit this communication pattern.
3 
AGENDA 
HW overview 
‒Chassis, compute/storage cards, fabric 
SW stack description 
‒OS, Virtualization, MPI, File system 
Micro-benchmarks 
‒CPU, memory, network, storage 
Application 
‒Equations, computation, communication, check-pointing, scalability.
4 
HW OVERVIEW 
CHASSIS: FRONT AND BACK VIEWS 
FRONT BACK
5 
HW OVERVIEW 
CHASSIS: SIDE VIEW 
Total of 
4 (quadrants) 
x 16 compute cards plugged at both sides 
Total of 
8 storage cards 
x 8 drives each plugged at the front
6 
HW OVERVIEW 
 AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz 
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8 
CPU 
RAM 
PCI 
chipset 
FB 1 FB 2 FB 8
7 
HW OVERVIEW 
 AMD OpteronTM 4365EE processor 
 8 “Piledriver” cores, AVX, FMA3/4 
 2.0GHz core frequency 
 Max Turbo core frequency up to 3.1GHz 
 40W TDP 
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8 
Northbridge 
Core 0 
L3 cache 
Core 1 
L2 
Core 2 
Core 3 
L2 
Core 4 
Core 5 
L2 
Core 6 
Core 7 
L2 
HT 
PHY 
nCHT 
DRAM CTL 
2 memory channels
8 
HW OVERVIEW 
 Support for RAID and non RAID 
 8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB, 
 Or 8 SSD drives, 80GB-2TB 
 System can operate without disks 
STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS 
CPU 
RAM 
PCI 
chipset 
FB 1 FB 2 FB 8 
H 
SATA 
Disk1 Disk2 Disk8 
FB 1 FB 2 FB 8
9 
HW DESCRIPTION/OVERVIEW 
 2 x 10Gb Ethernet Module 
‒ External ports 
‒ 2 Mini SAS 
‒ 2 x 10GbE SFP+ 
‒ External Port Bandwidth: 
‒ 20 Gbps Full Duplex 
‒ Internal Bandwidth to Fabric: 
‒ 32 Gbps Full Duplex. 
 8 x 1Gb Ethernet Module 
‒ External ports 
‒ 2 Mini SAS 
‒ 8x 1GbE 1000BaseT 
‒ External Port Bandwidth: 
‒ 8 Gbps Full Duplex 
‒ Internal Bandwidth to Fabric: 
‒ 32 Gbps Full Duplex. 
MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE
10 
HW OVERVIEW 
FABRIC TOPOLOGY: 3D TORUS 
 3D torus network fabric 
 8 x 8 x 8 Fabric nodes 
 Diameter (max hop) 4 + 4 + 4 = 12 
 Theor. cross section bandwidth = 
2 (periodic) x 8 x 8 (section) x 
2(bidir) x 2.0Gbps/link = 512Gb/s 
 Compute, storage, mgmt cards 
are plugged into the network fabric. 
 Support for hot plugged compute cards.
11 
AGENDA 
HW overview 
‒Chassis, compute/storage/management cards, fabric 
SW stack description 
‒OS, Virtualization, MPI, File system 
Micro-benchmarks 
‒CPU, memory, network, storage 
Application 
‒Equations, computation, communication, check-pointing, scalability.
12 
SW STACK DESCRIPTION 
 Overall System Management 
‒ Command Line Interface 
 NOTHING AT ALL CUSTOM FOR INSTALLATION 
‒OS support 
‒Linux (RH, SLES, CentOS, Ubuntu), Windows® 
‒Virtualization 
‒VMware, Xen, KVM, HyperV 
‒Network SW stack 
‒Everything that runs on top of Ethernet HW. 
‒File systems 
‒Local, shared , parallel. 
‒Distributed memory programming 
‒MPI, UPC, ..
13 
AGENDA 
HW overview 
‒Chassis, compute/storage/managment cards, fabric 
SW stack description 
‒OS, Virtualization, MPI, File system 
Micro-benchmarks 
‒CPU, memory, network, storage 
Application 
‒Equations, computation, communication, check-pointing, scalability.
14 
MICRO-BENCHMARKS 
CPU, POWER 
 Benchmark HPL, leveraging FMA4,3 
 Single Compute card 
‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card 
‒ 40W TDP processor, 60W per compute card running HPL 
========================================================================= 
T/V N NB P Q Time Gflops 
-------------------------------------------------------------------------------- 
WR01L2L4 40000 100 2 4 795.23 5.366e+01 (83% efficiency) 
-------------------------------------------------------------------------------- 
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033267 ...... PASSED 
========================================================================= 
 Chassis with 64 compute cards 
‒ 2.95 DP Teraflops/sec/ chassis 
‒ 72% HPL Efficiency/ chassis (MPI over Ethernet) 
‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans).
15 
MICRO-BENCHMARKS 
MEMORY, POWER 
 Benchmark STREAM 
 Single Compute card @ 1333MHz memory frequency 
‒ 15GB/s 
Function Best Rate MB/s Avg time Min time Max time 
Copy: 14647.1 0.181456 0.181333 0.181679 
Scale: 15221.7 0.175883 0.175615 0.176168 
Add: 14741.2 0.270838 0.270557 0.271005 
Triad: 15105.2 0.269939 0.269585 0.270251 
‒ Power 15 W idle per card. 
‒ Power 30 W stream per card. 
 STREAM chassis 
‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis. 
‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans).
16 
MICRO-BENCHMARKS 
NIC TUNING 
 Ethernet related tuning: 
- Ethernet Driver, 8.0.35-NAPI 
- InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options) 
- MTU 9000 (ifconfig) 
 Interrupt balance fabric nodes to different cores. 
 MPI TCP tuning 
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7 
- -mca btl_tcp_eager_limit 1mb (default is 64kb) 
 UPC tuning 
‒ using UDP instead of MPI+TCP
17 
MICRO-BENCHMARKS 
NIC TUNING 
 MPI related tuning: 
 8 Ethernet networks, one per fabric node across all 64 compute cards. 
 OpenMPI, with Ethernet, TCP, defaults to use all networks. 
Can be restricted with arguments passed to mpirun command or in 
openmpi.conf file 
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7 
 Point to Point communications 
Latency: 30-36 usec 
Bandwidth: linear scaling from 1 to 8 fabric nodes 
1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional 
8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional
18 
MICRO-BENCHMARKS 
NETWORK PERFORMANCE 
 Point 2 Point benchmark setup 
 Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8 
between any 2 compute cards in the chassis. 
FB 1 FB 2 FB 8 
FB 1 FB 2 FB 8 
CPU 
RAM 
PCI 
chipset 
FB 8 FB 2 FB 1 
CPU 
RAM 
PCI 
chipset 
FB 1 FB 2 FB 8
19 
MICRO-BENCHMARKS 
NETWORK PERFORMANCE 
0 
100 
200 
300 
400 
500 
600 
700 
800 
900 
1000 
1 16 256 4096 65536 1048576 
bandwidth (MB/s) 
msg size (bytes) 
Unidirectional MPI bandwidth 
eth0 eth0-1 eth0-eth3 eth0-eth7 
0 
200 
400 
600 
800 
1000 
1200 
1400 
1600 
1 16 256 4096 65536 1048576 
bi bandwidth (MB/s) 
msg size (bytes) 
Bidirectional MPI bandwidth 
eth0 eth0-1 eth0-eth3 eth0-eth7 
1500MB/s 
960MB/s 
195MB/s 
120MB/s
20 
MICRO-BENCHMARKS 
NETWORK PERFORMANCE 
 Message rate setup: 
 Every other core sending messages through each fabric node to another core on 
another compute card. 
 4 pairs of MPI processes sending data striped across the 8 fabric nodes until 
maxing out bandwidth of the fabric. 
c0 
c2 
c4 
c6 
FB 1 FB 2 FB 8 
FB 8 FB 2 FB 1 
FB 1 FB 2 FB 8 
FB 1 FB 2 FB 8 
c0 
c2 
c4 
c6
21 
MICRO-BENCHMARKS 
NETWORK PERFORMANCE 
 4KB message rate scalability 1,2,4,8 fabric nodes 
 Maxing out network bandwidth 
0 
50000 
100000 
150000 
200000 
250000 
1 2 3 4 5 6 7 8 
4KB MPI messages/second 
number of fabric nodes per compute card 
4KB MPI message rate (bw max out) 
1 MPI pair 2 MPI pairs 4 MPI pairs 
240K 4K msg/s 
160K 4K msg/s 
120K 4K msg/s
22 
MICRO-BENCHMARKS 
NETWORK PERFORMANCE 
 Allreduce setup, for inner products. 
<x,y> = 푥푖 ∗ 푦푖 
64 
1 
 Models well with binary tree algorithm 
~30usec*log2(64cards) = 180usec 
MPI reduce MPI broadcast 
+ 
# c cards Elapsed 
time (usec) 
2 25.97 
4 54.41 
8 82.59 
16 110.31 
32 138.66 
64, chassis 170.02 
Notice: application described later, 
has as Xi*Yi as multithreaded 
(OpenMP) inner product + reduction 
followed by MPI_Allreduce.
23 
 Cross section bandwidth measurement. 
 Sectioned in Z plane. 
 Aggregated bandwidth in X plane, 
 MPI multirail stripping messages 
across all the fabric nodes within 
compute card. 
 Aggregated in Y plane, distributed. 
 2 pairs (green and purple) cross 
Z-section without congestion on the 
links (orange). 
 Links still not saturated. 
 8 Xplanes * 8 Yplanes * 4 pairs *1500Mbit/s bidir ASIC [measured] = 384000Mb/s = 48 GB/s. 
 Measured 43.5GB/s (90.6% network bandwidth utilization) using only 1 core per compute card. 
MICRO-BENCHMARKS 
NETWORK PERFORMANCE 
FB FB 
FB FB 
FB FB 
FB FB 
Z - section 
X plane 7 
ccard 
FB FB 
FB FB 
FB FB 
FB FB 
Z - section 
X plane 0 
FB FB 
FB FB 
FB FB 
FB FB 
Z - section 
X plane 7 
FB FB 
FB FB 
FB FB 
FB FB 
Z - section 
X plane 0 
Y plane 0 Y plane 7
24 
MICRO-BENCHMARKS 
STORAGE PERFORMANCE 
 Sustained writes (OS caching not leveraged) 
 1 Vdisk, SATA HDD 7.2k rpms, 64MB cache 
For checkpointing: 
‒ Iozone sustained writes 45MB/s, 2GB file, 1MB record length. 
 64 Vdisks concurrently, 1 Vdisk per compute card 
‒ Iozone sustained writes 2.88GB/s entire chassis, local file systems. 
64 x 2GB files , 1MB record length. 
 Depending on configuration with same disks can reach up to 95MB/s sustained 
writes per compute card. 95MB/s x 64disks = 6 GB/s entire chassis
25 
AGENDA 
HW overview 
‒Chassis, compute/storage/management cards, fabric 
SW stack description 
‒OS, Virtualization, MPI, File system 
Micro-benchmarks 
‒CPU, memory, network, storage 
Application 
‒Equations, computation, communication, check-pointing, scalability.
26 
 Equations 
‒Navier Stokes, Wave, Heat-Mass transfer.. 
 Discretization of 3D Laplace’s equation 
휕2푓 
휕2푥 
+ 
휕2푓 
휕2푦 
+ 
휕2푓 
휕2푧 
=0 
‒ 8th order scheme, central difference scheme 
W4 W3 W2 W1 P E1 E2 E3 E4 
Derivative Accuracy −4 −3 −2 −1 0 1 2 3 4 
2 8 −1/560 8/315 −1/5 8/5 - 
ퟐퟎퟓ 
ퟕퟐ 
8/5 −1/5 8/315 −1/560 
APPLICATION 
EQUATIONS AND HIGH ORDER SCHEMES
27 
 Derived Equation for unknown at position P x(i,j,k), 25 point stencil 
푾ퟒ(풊, 풋, 풌) ∗ 풙(풊 − ퟒ, 풋, 풌) + 푾ퟑ(풊, 풋, 풌) ∗ 풙(풊 − ퟑ, 풋, 풌) + 푾ퟐ 풊, 풋, 풌 ∗ 풙 풊 − ퟐ, 풋, 풌 + 푾ퟏ(풊, 풋, 풌) ∗ 풙(풊 − ퟏ, 풋, 풌) + 
+ Eퟒ(풊, 풋, 풌) ∗ 풙(풊 + ퟒ, 풋, 풌) + 푬ퟑ(풊, 풋, 풌) ∗ 풙(풊 + ퟑ, 풋, 풌) + 푬ퟐ 풊, 풋, 풌 ∗ 풙 풊 + ퟐ, 풋, 풌 + Eퟏ(풊, 풋, 풌) ∗ 풙(풊 + ퟏ, 풋, 풌) + 
+ Sퟒ(풊, 풋, 풌) ∗ 풙(풊, 풋 − ퟒ, 풌) + 푺ퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋 − ퟑ, 풌) + 푺ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋 − ퟐ, 풌 + Sퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋 − ퟏ, 풌) + 
+ Nퟒ(풊, 풋, 풌) ∗ 풙(풊, 풋 + ퟒ, 풌) + Nퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋 + ퟑ, 풌) + 푵ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋 + ퟐ, 풌 + Nퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋 + ퟏ, 풌) + 
+ Bퟒ 풊, 풋, 풌 ∗ 풙 풊, 풋, 풌 − ퟒ + 푩ퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 − ퟑ) + 퐁ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋, 풌 − ퟐ + 퐁ퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 − ퟏ) + 
+ Tퟒ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 + ퟒ) + Tퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 + ퟑ) + 푻ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋, 풌 + ퟐ + Tퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 + ퟏ) + 
+ P(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌) = 0 
 The coefficients express how strong is the relationship in the vicinity of P. 
25 coef, 25mult, 25 adds,…lots of FMAs per eq. 
 Linear system of equations to map the domain. 
A*x=b, 
A, square sparse matrix of coefficients (25 diagonals), 
x, vector of unknowns, 
b, vector of boundary conditions 
z 
x 
y 
P 
APPLICATION 
DISCRETIZATION
28 
 Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less 
than 5 to 7: U, V, W, P, T, k, e) 
 Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core. 
 25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~ 
30 vectors (in 3D). 
 1 Single Precision (SP) FLOAT is 4bytes. 
1GB∗(1SPF/4B)/30eq 3 ≈ 200 points in each direction per core. 
 Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points 
 Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of 
4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning. 
 200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear 
system at every computation of 200x200x200 points. 
 200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache. 
APPLICATION 
COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION
29 
 Advantages when using high order schemes: 
‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy. 
‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of 
vector instructions. Implementation dependent. Otherwise extremely memory bound. 
‒Better network bandwidth utilization due to larger message size of halo at higher order 
scheme. 
‒ Tradeoff: Higher communication volume for higher order scheme 
‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks) 
for neighboring communications. 
‒Larger messages provide more chances to overlap communication with computation. 
‒More network latency tolerant. 
APPLICATION 
IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO
30 
APPLICATION 
P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE 
West East 
Top 
Bottom 
North 
South 
 8 cores per compute card. 
 Multithreaded computation with OpenMP threads. 
 Threading only in k loops (i,j,k) 
 In general case, 6 exchanges (gray area) 
with neighbor compute cards 
 halo (message size) of 200x200x4 
 Best HW mapping 1x8x8 partitions 
‒ No partition in X fabric nodes 
‒ 8 partitions in Y fabric nodes 
‒ 8 partitions in Z fabric nodes 
 Best algorithm mapping 4x4x4 
‒ Less exchange than 1x8x8 
4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8
31 
APPLICATION 
ITERATIVE ALGORITHM 
Solve linear system 1 
Solve linear system 2 
Solve linear system 7 
Solve linear system 8 
Overall convergence ? 
Solve linear system 1 
Solve linear system 2 
Solve linear system 7 
Solve linear system 8 
Solve linear system 1 
Solve linear system 2 
Solve linear system 7 
Solve linear system 8 
End 
Core 1 Core 2 Core 512 (full chassis) 
Yes 
Not yet 
Linear system equations 
Checkpointing values 
Bidirectional Exchange 
of halo/communication with 
neighbor domain hosted by 
another core/processor/compute card
32 
 Compute card with 1 CPU=1 NUMAnode. 
 No chance for NUMA misses. 
 Easy to leverage openMP within MPI code without having to worry about remote memory 
accesses. 
 Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet. 
 3 Compute units can max-out memory controller bandwidth. (plenty of computing capability) 
 1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with 
computation stages. 
 Single core per CPU for MPI communications: 
‒ Aggregating halo data for all threads to send more data per message. 
‒ leveraging MPI non blocking communications for halo exchange. 
‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth 
(0.6MB/message) 
‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction. 
APPLICATION 
PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION
33 
 Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision 
 (no cheating with weak scaling) 
 Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling 
all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis. 
APPLICATION 
PERFORMANCE SUMMARY 
# 
compute 
cards 
Computation 
Mcells/ 
Sec 
Speed 
up 
Wrt 8 
cards 
Efficiency 
Wrt 8 
cards 
Comm overhead 
wrt total time 
Halo 
exchange 
reduction 
8 273 8 100% 5.5% 6% 
16 536 15.7 98.1% 5.5% 7% 
32 1065 31.2 97.5% 5.7% 8% 
64 2048 60.0 93.7% 5.8% 11% 
8 
16 
32 
64 
8 16 32 64 
Speed up 
# compute cards 
3DFD speed up on Seamicro 
Speed up 
Ideal 
No change in total volume exchanged 
when increased card count, constant comm overhead. 
Expected increase, when increased card count, 
as shown in Allreduce micro benchmark
34 
CONCLUSIONS 
 Proven suitability (ie. scalability) for 3D finite difference stencil computations, 
leveraging latest software programming paradigms (MPI + openMP). 
‒This is a proxy for many other High Performance Computing applications with 
similar computational requirements (eg. Manufacturing, Oil and Gas, 
Weather...) 
 System Advantages: 
‒High computing density (performance and performance per Watt) in 10U form 
factor 
‒Per compute/storage card 
‒Scalability provided through Seamicro fabric 
‒High flexibility in compute, network, storage configurations adjusted to your 
workload requirements as demonstrated in this application.
35 
DISCLAIMER & ATTRIBUTION 
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. 
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap 
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software 
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD 
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of 
such revisions or changes. 
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY 
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. 
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE 
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION 
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 
ATTRIBUTION 
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, 
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names 
are for informational purposes only and may be trademarks of their respective owners.

More Related Content

What's hot

Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksJason Riedy
 
HPE ProLiant DL380 Gen10 Server Data Sheet
HPE ProLiant DL380 Gen10 Server Data SheetHPE ProLiant DL380 Gen10 Server Data Sheet
HPE ProLiant DL380 Gen10 Server Data Sheet美兰 曾
 
Fit pc-product-line-overview
Fit pc-product-line-overviewFit pc-product-line-overview
Fit pc-product-line-overviewabenitop
 
HPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data SheetHPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data Sheet美兰 曾
 
PCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewPCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewIOSRJVSP
 
Hp dl 380 g9
Hp dl 380 g9Hp dl 380 g9
Hp dl 380 g9ganeshvm1
 
Chester County Interlink
Chester County InterlinkChester County Interlink
Chester County Interlinkbooomer1265
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttecTTEC
 
Implementation of High Reliable 6T SRAM Cell Design
Implementation of High Reliable 6T SRAM Cell DesignImplementation of High Reliable 6T SRAM Cell Design
Implementation of High Reliable 6T SRAM Cell Designiosrjce
 
IXP 23XX Network processor
IXP 23XX Network processorIXP 23XX Network processor
IXP 23XX Network processorYuvaraja Ravi
 
02 the cpu
02 the cpu02 the cpu
02 the cpuJim Finn
 
Dmx3 950-technical specifications
Dmx3 950-technical specificationsDmx3 950-technical specifications
Dmx3 950-technical specificationsRaghul P
 
Evaluation of Branch Predictors
Evaluation of Branch PredictorsEvaluation of Branch Predictors
Evaluation of Branch PredictorsBharat Biyani
 
Raspberry pi's gpio programming with go
Raspberry pi's gpio programming with goRaspberry pi's gpio programming with go
Raspberry pi's gpio programming with goKonstantin Shamko
 

What's hot (20)

Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
Micropro
MicroproMicropro
Micropro
 
Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 
HPE ProLiant DL380 Gen10 Server Data Sheet
HPE ProLiant DL380 Gen10 Server Data SheetHPE ProLiant DL380 Gen10 Server Data Sheet
HPE ProLiant DL380 Gen10 Server Data Sheet
 
Fit pc-product-line-overview
Fit pc-product-line-overviewFit pc-product-line-overview
Fit pc-product-line-overview
 
One day-workshop on tms320 f2812
One day-workshop on tms320 f2812One day-workshop on tms320 f2812
One day-workshop on tms320 f2812
 
HPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data SheetHPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data Sheet
 
PCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewPCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-Review
 
Hp dl 380 g9
Hp dl 380 g9Hp dl 380 g9
Hp dl 380 g9
 
Chester County Interlink
Chester County InterlinkChester County Interlink
Chester County Interlink
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttec
 
Implementation of High Reliable 6T SRAM Cell Design
Implementation of High Reliable 6T SRAM Cell DesignImplementation of High Reliable 6T SRAM Cell Design
Implementation of High Reliable 6T SRAM Cell Design
 
IXP 23XX Network processor
IXP 23XX Network processorIXP 23XX Network processor
IXP 23XX Network processor
 
02 the cpu
02 the cpu02 the cpu
02 the cpu
 
Dmx3 950-technical specifications
Dmx3 950-technical specificationsDmx3 950-technical specifications
Dmx3 950-technical specifications
 
Evaluation of Branch Predictors
Evaluation of Branch PredictorsEvaluation of Branch Predictors
Evaluation of Branch Predictors
 
Tms320 f2812
Tms320 f2812Tms320 f2812
Tms320 f2812
 
Raspberry pi's gpio programming with go
Raspberry pi's gpio programming with goRaspberry pi's gpio programming with go
Raspberry pi's gpio programming with go
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 

Similar to Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems

CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxsyed rafi
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.pptPencilData
 
VJITSk 6713 user manual
VJITSk 6713 user manualVJITSk 6713 user manual
VJITSk 6713 user manualkot seelam
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsIntroducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsPremier Farnell
 
SoM with Zynq UltraScale device
SoM with Zynq UltraScale deviceSoM with Zynq UltraScale device
SoM with Zynq UltraScale devicenie, jack
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1wjunjmt
 
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-ModuleMYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-ModuleLinda Zhang
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
IBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationIBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationCliff Kinard
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Community
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsRebekah Rodriguez
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 

Similar to Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems (20)

CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Zynq ultrascale
Zynq ultrascaleZynq ultrascale
Zynq ultrascale
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptx
 
SUN主机产品介绍.ppt
SUN主机产品介绍.pptSUN主机产品介绍.ppt
SUN主机产品介绍.ppt
 
VJITSk 6713 user manual
VJITSk 6713 user manualVJITSk 6713 user manual
VJITSk 6713 user manual
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsIntroducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
 
SoM with Zynq UltraScale device
SoM with Zynq UltraScale deviceSoM with Zynq UltraScale device
SoM with Zynq UltraScale device
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-ModuleMYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
 
uCluster
uClusteruCluster
uCluster
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
IBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical PresentationIBM System x3850 X5 Technical Presentation
IBM System x3850 X5 Technical Presentation
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
 
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade SystemsThe Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
The Power of One: Supermicro’s High-Performance Single-Processor Blade Systems
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Computer Maintanance
Computer MaintananceComputer Maintanance
Computer Maintanance
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems

  • 1. PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS ON SEAMICRO FABRIC COMPUTE SYSTEMS JOSHUA MORA
  • 2. 2 ABSTRACT  Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric).  This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications.  Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics).  We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system.  We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern.
  • 3. 3 AGENDA HW overview ‒Chassis, compute/storage cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability.
  • 4. 4 HW OVERVIEW CHASSIS: FRONT AND BACK VIEWS FRONT BACK
  • 5. 5 HW OVERVIEW CHASSIS: SIDE VIEW Total of 4 (quadrants) x 16 compute cards plugged at both sides Total of 8 storage cards x 8 drives each plugged at the front
  • 6. 6 HW OVERVIEW  AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8 CPU RAM PCI chipset FB 1 FB 2 FB 8
  • 7. 7 HW OVERVIEW  AMD OpteronTM 4365EE processor  8 “Piledriver” cores, AVX, FMA3/4  2.0GHz core frequency  Max Turbo core frequency up to 3.1GHz  40W TDP COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8 Northbridge Core 0 L3 cache Core 1 L2 Core 2 Core 3 L2 Core 4 Core 5 L2 Core 6 Core 7 L2 HT PHY nCHT DRAM CTL 2 memory channels
  • 8. 8 HW OVERVIEW  Support for RAID and non RAID  8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB,  Or 8 SSD drives, 80GB-2TB  System can operate without disks STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS CPU RAM PCI chipset FB 1 FB 2 FB 8 H SATA Disk1 Disk2 Disk8 FB 1 FB 2 FB 8
  • 9. 9 HW DESCRIPTION/OVERVIEW  2 x 10Gb Ethernet Module ‒ External ports ‒ 2 Mini SAS ‒ 2 x 10GbE SFP+ ‒ External Port Bandwidth: ‒ 20 Gbps Full Duplex ‒ Internal Bandwidth to Fabric: ‒ 32 Gbps Full Duplex.  8 x 1Gb Ethernet Module ‒ External ports ‒ 2 Mini SAS ‒ 8x 1GbE 1000BaseT ‒ External Port Bandwidth: ‒ 8 Gbps Full Duplex ‒ Internal Bandwidth to Fabric: ‒ 32 Gbps Full Duplex. MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE
  • 10. 10 HW OVERVIEW FABRIC TOPOLOGY: 3D TORUS  3D torus network fabric  8 x 8 x 8 Fabric nodes  Diameter (max hop) 4 + 4 + 4 = 12  Theor. cross section bandwidth = 2 (periodic) x 8 x 8 (section) x 2(bidir) x 2.0Gbps/link = 512Gb/s  Compute, storage, mgmt cards are plugged into the network fabric.  Support for hot plugged compute cards.
  • 11. 11 AGENDA HW overview ‒Chassis, compute/storage/management cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability.
  • 12. 12 SW STACK DESCRIPTION  Overall System Management ‒ Command Line Interface  NOTHING AT ALL CUSTOM FOR INSTALLATION ‒OS support ‒Linux (RH, SLES, CentOS, Ubuntu), Windows® ‒Virtualization ‒VMware, Xen, KVM, HyperV ‒Network SW stack ‒Everything that runs on top of Ethernet HW. ‒File systems ‒Local, shared , parallel. ‒Distributed memory programming ‒MPI, UPC, ..
  • 13. 13 AGENDA HW overview ‒Chassis, compute/storage/managment cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability.
  • 14. 14 MICRO-BENCHMARKS CPU, POWER  Benchmark HPL, leveraging FMA4,3  Single Compute card ‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card ‒ 40W TDP processor, 60W per compute card running HPL ========================================================================= T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR01L2L4 40000 100 2 4 795.23 5.366e+01 (83% efficiency) -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033267 ...... PASSED =========================================================================  Chassis with 64 compute cards ‒ 2.95 DP Teraflops/sec/ chassis ‒ 72% HPL Efficiency/ chassis (MPI over Ethernet) ‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans).
  • 15. 15 MICRO-BENCHMARKS MEMORY, POWER  Benchmark STREAM  Single Compute card @ 1333MHz memory frequency ‒ 15GB/s Function Best Rate MB/s Avg time Min time Max time Copy: 14647.1 0.181456 0.181333 0.181679 Scale: 15221.7 0.175883 0.175615 0.176168 Add: 14741.2 0.270838 0.270557 0.271005 Triad: 15105.2 0.269939 0.269585 0.270251 ‒ Power 15 W idle per card. ‒ Power 30 W stream per card.  STREAM chassis ‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis. ‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans).
  • 16. 16 MICRO-BENCHMARKS NIC TUNING  Ethernet related tuning: - Ethernet Driver, 8.0.35-NAPI - InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options) - MTU 9000 (ifconfig)  Interrupt balance fabric nodes to different cores.  MPI TCP tuning - -mca btl_tcp_if_include eth0,eth2,…eth6,eth7 - -mca btl_tcp_eager_limit 1mb (default is 64kb)  UPC tuning ‒ using UDP instead of MPI+TCP
  • 17. 17 MICRO-BENCHMARKS NIC TUNING  MPI related tuning:  8 Ethernet networks, one per fabric node across all 64 compute cards.  OpenMPI, with Ethernet, TCP, defaults to use all networks. Can be restricted with arguments passed to mpirun command or in openmpi.conf file - -mca btl_tcp_if_include eth0,eth2,…eth6,eth7  Point to Point communications Latency: 30-36 usec Bandwidth: linear scaling from 1 to 8 fabric nodes 1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional 8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional
  • 18. 18 MICRO-BENCHMARKS NETWORK PERFORMANCE  Point 2 Point benchmark setup  Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8 between any 2 compute cards in the chassis. FB 1 FB 2 FB 8 FB 1 FB 2 FB 8 CPU RAM PCI chipset FB 8 FB 2 FB 1 CPU RAM PCI chipset FB 1 FB 2 FB 8
  • 19. 19 MICRO-BENCHMARKS NETWORK PERFORMANCE 0 100 200 300 400 500 600 700 800 900 1000 1 16 256 4096 65536 1048576 bandwidth (MB/s) msg size (bytes) Unidirectional MPI bandwidth eth0 eth0-1 eth0-eth3 eth0-eth7 0 200 400 600 800 1000 1200 1400 1600 1 16 256 4096 65536 1048576 bi bandwidth (MB/s) msg size (bytes) Bidirectional MPI bandwidth eth0 eth0-1 eth0-eth3 eth0-eth7 1500MB/s 960MB/s 195MB/s 120MB/s
  • 20. 20 MICRO-BENCHMARKS NETWORK PERFORMANCE  Message rate setup:  Every other core sending messages through each fabric node to another core on another compute card.  4 pairs of MPI processes sending data striped across the 8 fabric nodes until maxing out bandwidth of the fabric. c0 c2 c4 c6 FB 1 FB 2 FB 8 FB 8 FB 2 FB 1 FB 1 FB 2 FB 8 FB 1 FB 2 FB 8 c0 c2 c4 c6
  • 21. 21 MICRO-BENCHMARKS NETWORK PERFORMANCE  4KB message rate scalability 1,2,4,8 fabric nodes  Maxing out network bandwidth 0 50000 100000 150000 200000 250000 1 2 3 4 5 6 7 8 4KB MPI messages/second number of fabric nodes per compute card 4KB MPI message rate (bw max out) 1 MPI pair 2 MPI pairs 4 MPI pairs 240K 4K msg/s 160K 4K msg/s 120K 4K msg/s
  • 22. 22 MICRO-BENCHMARKS NETWORK PERFORMANCE  Allreduce setup, for inner products. <x,y> = 푥푖 ∗ 푦푖 64 1  Models well with binary tree algorithm ~30usec*log2(64cards) = 180usec MPI reduce MPI broadcast + # c cards Elapsed time (usec) 2 25.97 4 54.41 8 82.59 16 110.31 32 138.66 64, chassis 170.02 Notice: application described later, has as Xi*Yi as multithreaded (OpenMP) inner product + reduction followed by MPI_Allreduce.
  • 23. 23  Cross section bandwidth measurement.  Sectioned in Z plane.  Aggregated bandwidth in X plane,  MPI multirail stripping messages across all the fabric nodes within compute card.  Aggregated in Y plane, distributed.  2 pairs (green and purple) cross Z-section without congestion on the links (orange).  Links still not saturated.  8 Xplanes * 8 Yplanes * 4 pairs *1500Mbit/s bidir ASIC [measured] = 384000Mb/s = 48 GB/s.  Measured 43.5GB/s (90.6% network bandwidth utilization) using only 1 core per compute card. MICRO-BENCHMARKS NETWORK PERFORMANCE FB FB FB FB FB FB FB FB Z - section X plane 7 ccard FB FB FB FB FB FB FB FB Z - section X plane 0 FB FB FB FB FB FB FB FB Z - section X plane 7 FB FB FB FB FB FB FB FB Z - section X plane 0 Y plane 0 Y plane 7
  • 24. 24 MICRO-BENCHMARKS STORAGE PERFORMANCE  Sustained writes (OS caching not leveraged)  1 Vdisk, SATA HDD 7.2k rpms, 64MB cache For checkpointing: ‒ Iozone sustained writes 45MB/s, 2GB file, 1MB record length.  64 Vdisks concurrently, 1 Vdisk per compute card ‒ Iozone sustained writes 2.88GB/s entire chassis, local file systems. 64 x 2GB files , 1MB record length.  Depending on configuration with same disks can reach up to 95MB/s sustained writes per compute card. 95MB/s x 64disks = 6 GB/s entire chassis
  • 25. 25 AGENDA HW overview ‒Chassis, compute/storage/management cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability.
  • 26. 26  Equations ‒Navier Stokes, Wave, Heat-Mass transfer..  Discretization of 3D Laplace’s equation 휕2푓 휕2푥 + 휕2푓 휕2푦 + 휕2푓 휕2푧 =0 ‒ 8th order scheme, central difference scheme W4 W3 W2 W1 P E1 E2 E3 E4 Derivative Accuracy −4 −3 −2 −1 0 1 2 3 4 2 8 −1/560 8/315 −1/5 8/5 - ퟐퟎퟓ ퟕퟐ 8/5 −1/5 8/315 −1/560 APPLICATION EQUATIONS AND HIGH ORDER SCHEMES
  • 27. 27  Derived Equation for unknown at position P x(i,j,k), 25 point stencil 푾ퟒ(풊, 풋, 풌) ∗ 풙(풊 − ퟒ, 풋, 풌) + 푾ퟑ(풊, 풋, 풌) ∗ 풙(풊 − ퟑ, 풋, 풌) + 푾ퟐ 풊, 풋, 풌 ∗ 풙 풊 − ퟐ, 풋, 풌 + 푾ퟏ(풊, 풋, 풌) ∗ 풙(풊 − ퟏ, 풋, 풌) + + Eퟒ(풊, 풋, 풌) ∗ 풙(풊 + ퟒ, 풋, 풌) + 푬ퟑ(풊, 풋, 풌) ∗ 풙(풊 + ퟑ, 풋, 풌) + 푬ퟐ 풊, 풋, 풌 ∗ 풙 풊 + ퟐ, 풋, 풌 + Eퟏ(풊, 풋, 풌) ∗ 풙(풊 + ퟏ, 풋, 풌) + + Sퟒ(풊, 풋, 풌) ∗ 풙(풊, 풋 − ퟒ, 풌) + 푺ퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋 − ퟑ, 풌) + 푺ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋 − ퟐ, 풌 + Sퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋 − ퟏ, 풌) + + Nퟒ(풊, 풋, 풌) ∗ 풙(풊, 풋 + ퟒ, 풌) + Nퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋 + ퟑ, 풌) + 푵ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋 + ퟐ, 풌 + Nퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋 + ퟏ, 풌) + + Bퟒ 풊, 풋, 풌 ∗ 풙 풊, 풋, 풌 − ퟒ + 푩ퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 − ퟑ) + 퐁ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋, 풌 − ퟐ + 퐁ퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 − ퟏ) + + Tퟒ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 + ퟒ) + Tퟑ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 + ퟑ) + 푻ퟐ 풊, 풋, 풌 ∗ 풙 풊, 풋, 풌 + ퟐ + Tퟏ(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌 + ퟏ) + + P(풊, 풋, 풌) ∗ 풙(풊, 풋, 풌) = 0  The coefficients express how strong is the relationship in the vicinity of P. 25 coef, 25mult, 25 adds,…lots of FMAs per eq.  Linear system of equations to map the domain. A*x=b, A, square sparse matrix of coefficients (25 diagonals), x, vector of unknowns, b, vector of boundary conditions z x y P APPLICATION DISCRETIZATION
  • 28. 28  Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less than 5 to 7: U, V, W, P, T, k, e)  Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core.  25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~ 30 vectors (in 3D).  1 Single Precision (SP) FLOAT is 4bytes. 1GB∗(1SPF/4B)/30eq 3 ≈ 200 points in each direction per core.  Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points  Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of 4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning.  200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear system at every computation of 200x200x200 points.  200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache. APPLICATION COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION
  • 29. 29  Advantages when using high order schemes: ‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy. ‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of vector instructions. Implementation dependent. Otherwise extremely memory bound. ‒Better network bandwidth utilization due to larger message size of halo at higher order scheme. ‒ Tradeoff: Higher communication volume for higher order scheme ‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks) for neighboring communications. ‒Larger messages provide more chances to overlap communication with computation. ‒More network latency tolerant. APPLICATION IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO
  • 30. 30 APPLICATION P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE West East Top Bottom North South  8 cores per compute card.  Multithreaded computation with OpenMP threads.  Threading only in k loops (i,j,k)  In general case, 6 exchanges (gray area) with neighbor compute cards  halo (message size) of 200x200x4  Best HW mapping 1x8x8 partitions ‒ No partition in X fabric nodes ‒ 8 partitions in Y fabric nodes ‒ 8 partitions in Z fabric nodes  Best algorithm mapping 4x4x4 ‒ Less exchange than 1x8x8 4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8
  • 31. 31 APPLICATION ITERATIVE ALGORITHM Solve linear system 1 Solve linear system 2 Solve linear system 7 Solve linear system 8 Overall convergence ? Solve linear system 1 Solve linear system 2 Solve linear system 7 Solve linear system 8 Solve linear system 1 Solve linear system 2 Solve linear system 7 Solve linear system 8 End Core 1 Core 2 Core 512 (full chassis) Yes Not yet Linear system equations Checkpointing values Bidirectional Exchange of halo/communication with neighbor domain hosted by another core/processor/compute card
  • 32. 32  Compute card with 1 CPU=1 NUMAnode.  No chance for NUMA misses.  Easy to leverage openMP within MPI code without having to worry about remote memory accesses.  Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet.  3 Compute units can max-out memory controller bandwidth. (plenty of computing capability)  1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with computation stages.  Single core per CPU for MPI communications: ‒ Aggregating halo data for all threads to send more data per message. ‒ leveraging MPI non blocking communications for halo exchange. ‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth (0.6MB/message) ‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction. APPLICATION PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION
  • 33. 33  Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision  (no cheating with weak scaling)  Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis. APPLICATION PERFORMANCE SUMMARY # compute cards Computation Mcells/ Sec Speed up Wrt 8 cards Efficiency Wrt 8 cards Comm overhead wrt total time Halo exchange reduction 8 273 8 100% 5.5% 6% 16 536 15.7 98.1% 5.5% 7% 32 1065 31.2 97.5% 5.7% 8% 64 2048 60.0 93.7% 5.8% 11% 8 16 32 64 8 16 32 64 Speed up # compute cards 3DFD speed up on Seamicro Speed up Ideal No change in total volume exchanged when increased card count, constant comm overhead. Expected increase, when increased card count, as shown in Allreduce micro benchmark
  • 34. 34 CONCLUSIONS  Proven suitability (ie. scalability) for 3D finite difference stencil computations, leveraging latest software programming paradigms (MPI + openMP). ‒This is a proxy for many other High Performance Computing applications with similar computational requirements (eg. Manufacturing, Oil and Gas, Weather...)  System Advantages: ‒High computing density (performance and performance per Watt) in 10U form factor ‒Per compute/storage card ‒Scalability provided through Seamicro fabric ‒High flexibility in compute, network, storage configurations adjusted to your workload requirements as demonstrated in this application.
  • 35. 35 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.