SlideShare a Scribd company logo
1 of 35
Download to read offline
PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS
ON SEAMICRO FABRIC COMPUTE SYSTEMS
JOSHUA MORA
ABSTRACT
 Seamicro fabric compute systems offers an array of low power compute nodes interconnected
with a 3D torus network fabric (branded Freedom Supercomputer Fabric).
 This specific network topology allows very efficient point to point communications where only
your neighbor compute nodes are involved in the communications.
 Such type of communication pattern arises in a wide variety of distributed memory applications
like in 3D Finite Difference computational stencils, present on many computationally expensive
scientific applications (eg. seismic, computational fluid dynamics).
 We present the performance analysis (computation, communication, scalability) of a generic 3D
Finite Difference computational stencil on such a system.
 We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for
HPC applications that exhibit this communication pattern.

2
AGENDA
HW overview
‒Chassis, compute/storage cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.

3
HW OVERVIEW
CHASSIS: FRONT AND BACK VIEWS

FRONT
4

BACK
HW OVERVIEW
CHASSIS: SIDE VIEW

5

Total of
8 storage cards
x 8 drives each plugged at the front

Total of
4 (quadrants)
x 16 compute cards plugged at both sides
HW OVERVIEW
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8

RAM

 AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz

CPU

FB 1

6

FB 2

PCI
chipset

FB 8
HW OVERVIEW
Core 6
Core 7

 2.0GHz core frequency

Core 4
Core 5

 8 “Piledriver” cores, AVX, FMA3/4

Core 2
Core 3

 AMD OpteronTM 4365EE processor

Core 0
Core 1

COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8

L2

L2

L2

L2

 40W TDP

Northbridge

HT
PHY

 Max Turbo core frequency up to 3.1GHz

DRAM CTL
L3 cache

2 memory channels

7

nCHT
HW OVERVIEW
STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS

 Support for RAID and non RAID
 8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB,

 Or 8 SSD drives, 80GB-2TB
 System can operate without disks

CPU

FB 1
Disk1
H
8

FB 2
Disk2

FB 2

RAM

FB 1

FB 8

PCI
chipset
SATA

FB 8
Disk8
HW DESCRIPTION/OVERVIEW
MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE

 2 x 10Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 2 x 10GbE SFP+

‒ External Port Bandwidth:
‒ 20 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
‒

32 Gbps Full Duplex.

 8 x 1Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 8x 1GbE 1000BaseT
‒ External Port Bandwidth:
‒

8 Gbps Full Duplex

‒ Internal Bandwidth to Fabric:
9

‒

32 Gbps Full Duplex.
HW OVERVIEW
FABRIC TOPOLOGY: 3D TORUS

 3D torus network fabric
 8 x 8 x 8 Fabric nodes
 Diameter (max hop) 4 + 4 + 4 = 12

 Theor. cross section bandwidth =
2 (periodic) x 8 x 8 (section) x
2(bidir) x 2.0Gbps/link = 512Gb/s
 Compute, storage, mgmt cards
are plugged into the network fabric.
 Support for hot plugged compute cards.

10
AGENDA
HW overview
‒Chassis, compute/storage/management cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.

11
SW STACK DESCRIPTION
 Overall System Management
‒ Command Line Interface
 NOTHING AT ALL CUSTOM FOR INSTALLATION

‒OS support
‒Linux (RH, SLES, CentOS, Ubuntu), Windows®
‒Virtualization
‒VMware, Xen, KVM, HyperV
‒Network SW stack
‒Everything that runs on top of Ethernet HW.
‒File systems
‒Local, shared , parallel.
‒Distributed memory programming
‒MPI, UPC, ..
12
AGENDA
HW overview
‒Chassis, compute/storage/managment cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.

13
MICRO-BENCHMARKS
CPU, POWER

 Benchmark HPL, leveraging FMA4,3
 Single Compute card
‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card
‒ 40W TDP processor, 60W per compute card running HPL

=========================================================================
T/V
N NB P Q
Time
Gflops
-------------------------------------------------------------------------------WR01L2L4
40000 100 2 4
795.23
5.366e+01 (83% efficiency)
-------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=
0.0033267 ...... PASSED
=========================================================================

 Chassis with 64 compute cards
‒ 2.95 DP Teraflops/sec/ chassis
‒ 72% HPL Efficiency/ chassis (MPI over Ethernet)
‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans).
14
MICRO-BENCHMARKS
MEMORY, POWER

 Benchmark STREAM
 Single Compute card @ 1333MHz memory frequency
‒ 15GB/s
Function Best Rate MB/s Avg time Min time Max time
Copy:
14647.1 0.181456 0.181333 0.181679
Scale:
15221.7 0.175883 0.175615 0.176168
Add:
14741.2 0.270838 0.270557 0.271005
Triad:
15105.2 0.269939 0.269585 0.270251
‒ Power 15 W idle per card.
‒ Power 30 W stream per card.
 STREAM chassis
‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis.
‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans).
15
MICRO-BENCHMARKS
NIC TUNING

 Ethernet related tuning:
-

Ethernet Driver, 8.0.35-NAPI

- InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options)
- MTU 9000 (ifconfig)

 Interrupt balance fabric nodes to different cores.
 MPI TCP tuning
-

-mca btl_tcp_if_include eth0,eth2,…eth6,eth7

- -mca btl_tcp_eager_limit 1mb (default is 64kb)
 UPC tuning
‒ using UDP instead of MPI+TCP

16
MICRO-BENCHMARKS
NIC TUNING

 MPI related tuning:
 8 Ethernet networks, one per fabric node across all 64 compute cards.
 OpenMPI, with Ethernet, TCP, defaults to use all networks.
Can be restricted with arguments passed to mpirun command or in
openmpi.conf file
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7

 Point to Point communications
Latency: 30-36 usec
Bandwidth: linear scaling from 1 to 8 fabric nodes
1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional

8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional
17
MICRO-BENCHMARKS
NETWORK PERFORMANCE

FB 1

FB 1

FB 2

FB 2

CPU

FB 1

FB 2

PCI
chipset

FB 8

FB 8

FB 8

RAM

FB 8

PCI
chipset

RAM

CPU

FB 1

FB 2

 Point 2 Point benchmark setup

 Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8
between any 2 compute cards in the chassis.
18
MICRO-BENCHMARKS

Bidirectional MPI bandwidth

NETWORK PERFORMANCE

eth0

eth0-1

eth0-eth3

eth0-eth7

1600

1500MB/s

1400

Unidirectional MPI bandwidth
eth0-1

eth0-eth3

eth0-eth7

1000
900

960MB/s

bandwidth (MB/s)

800
700
600
500
400

1000
800
600
400

300
200

200

100

120MB/s

0
1

19

bi bandwidth (MB/s)

eth0

1200

16

256
4096 65536 1048576
msg size (bytes)

195MB/s

0
1

16

256
4096 65536 1048576
msg size (bytes)
MICRO-BENCHMARKS
NETWORK PERFORMANCE

FB 1

FB 1

FB 1

FB 2

FB 2

FB 2

c2

FB 1

c0

FB 2

 Message rate setup:

c4

c2

FB 8

FB 8

FB 8

c4
FB 8

c6

c0

c6

 Every other core sending messages through each fabric node to another core on
another compute card.
20

 4 pairs of MPI processes sending data striped across the 8 fabric nodes until
maxing out bandwidth of the fabric.
MICRO-BENCHMARKS
NETWORK PERFORMANCE

 4KB message rate scalability 1,2,4,8 fabric nodes
 Maxing out network bandwidth
4KB MPI message rate (bw max out)
1 MPI pair

2 MPI pairs

4 MPI pairs

4KB MPI messages/second

250000

240K 4K msg/s

200000

160K 4K msg/s
120K 4K msg/s

150000

100000
50000
0
1

21

2

3
4
5
6
number of fabric nodes per compute card

7

8
MICRO-BENCHMARKS
NETWORK PERFORMANCE

 Allreduce setup, for inner products.
<x,y> =

64
1

𝑥𝑖 ∗ 𝑦𝑖

 Models well with binary tree algorithm
~30usec*log2(64cards) = 180usec

Notice: application described later,
has as Xi*Yi as multithreaded
(OpenMP) inner product + reduction
followed by MPI_Allreduce.
# c cards
2

16

110.31

32

MPI broadcast

82.59
138.66

64, chassis

MPI reduce

54.41

8

22

25.97

4

+

Elapsed
time (usec)

170.02
MICRO-BENCHMARKS
NETWORK PERFORMANCE

Y plane 0
X plane 7
FB

 MPI multirail stripping messages
compute card.

 Aggregated in Y plane, distributed.

X plane 0FB
FB
FB

 2 pairs (green and purple) cross
Z-section without congestion on the
links (orange).

FB

Z - section

across all the fabric nodes within

ccard
FB
FB

FB

FB

FB

X plane 7
FB

FB
Z - section

 Aggregated bandwidth in X plane,

FB

FB
FB

FB

X plane 0FB
FB

FB

FB

FB

Z - section

 Sectioned in Z plane.

Y plane 7

FB

FB
Z - section

 Cross section bandwidth measurement.

FB

FB

FB

FB

 Links still not saturated.
 8 Xplanes * 8 Yplanes * 4 pairs *1500Mbit/s bidir ASIC [measured] = 384000Mb/s = 48 GB/s.

 Measured 43.5GB/s (90.6% network bandwidth utilization) using only 1 core per compute card.
23

FB
FB
FB

FB

FB
MICRO-BENCHMARKS
STORAGE PERFORMANCE

 Sustained writes (OS caching not leveraged)
 1 Vdisk, SATA HDD 7.2k rpms, 64MB cache
For checkpointing:
‒ Iozone sustained writes 45MB/s, 2GB file, 1MB record length.
 64 Vdisks concurrently, 1 Vdisk per compute card
‒ Iozone sustained writes 2.88GB/s entire chassis, local file systems.
64 x 2GB files , 1MB record length.
 Depending on configuration with same disks can reach up to 95MB/s sustained
writes per compute card. 95MB/s x 64disks = 6 GB/s entire chassis
24
AGENDA
HW overview
‒Chassis, compute/storage/management cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.

25
APPLICATION
EQUATIONS AND HIGH ORDER SCHEMES

 Equations
‒Navier Stokes, Wave, Heat-Mass transfer..
 Discretization of 3D Laplace’s equation
𝜕2 𝑓
+
𝜕2 𝑥

𝜕2 𝑓
+
𝜕2 𝑦

𝜕2 𝑓
=0
𝜕2 𝑧

‒ 8th order scheme, central difference scheme
W4
Derivative Accuracy
2

26

8

W3

W2

W1

P

E1

E2

E3

E4

−4

−3

−2

−1

0

1

2

3

4

−1/560

8/315

−1/5

8/5

8/5

−1/5

8/315

−1/560

-

𝟐𝟎𝟓
𝟕𝟐
APPLICATION
DISCRETIZATION

 Derived Equation for unknown at position P x(i,j,k), 25 point stencil
𝑾𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟒, 𝒋, 𝒌) +

𝑾𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟑, 𝒋, 𝒌) +

𝑾𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊 − 𝟐, 𝒋, 𝒌

+ E𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟒, 𝒋, 𝒌) +

𝑬𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟑, 𝒋, 𝒌) +

𝑬𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊 + 𝟐, 𝒋, 𝒌

+ E𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟏, 𝒋, 𝒌) +

+ S𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟒, 𝒌) +

𝑺𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟑, 𝒌) +

𝑺𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋 − 𝟐, 𝒌

+ S𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟏, 𝒌) +

+ N𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟒, 𝒌) + N𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟑, 𝒌) +

𝑵𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋 + 𝟐, 𝒌

+ N𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟏, 𝒌) +

+ B𝟒 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 − 𝟒 +

𝐁𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 − 𝟐

+

𝑩𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 − 𝟑) +

+ T𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟒) + T𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟑) + 𝑻𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 + 𝟐

+

𝑾𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟏, 𝒋, 𝒌) +

𝐁𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 − 𝟏) +

+ T𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟏) +

z

+ P(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌) = 0

 The coefficients express how strong is the relationship in the vicinity of P.
25 coef, 25mult, 25 adds,…lots of FMAs per eq.

P

 Linear system of equations to map the domain.

27

A*x=b,
A, square sparse matrix of coefficients (25 diagonals),
x, vector of unknowns,
b, vector of boundary conditions

x

y
APPLICATION
COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION

 Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less
than 5 to 7: U, V, W, P, T, k, e)
 Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core.

 25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~
30 vectors (in 3D).
 1 Single Precision (SP) FLOAT is 4bytes.
3

1GB∗(1SPF/4B)/30eq ≈ 200 points in each direction per core.

 Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points
 Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of
4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning.
 200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear
system at every computation of 200x200x200 points.
 200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache.
28
APPLICATION
IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO

 Advantages when using high order schemes:
‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy.
‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of
vector instructions. Implementation dependent. Otherwise extremely memory bound.
‒Better network bandwidth utilization due to larger message size of halo at higher order
scheme.
‒ Tradeoff: Higher communication volume for higher order scheme
‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks)
for neighboring communications.
‒Larger messages provide more chances to overlap communication with computation.
‒More network latency tolerant.
29
APPLICATION
P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE






8 cores per compute card.
Multithreaded computation with OpenMP threads.
Threading only in k loops (i,j,k)
In general case, 6 exchanges (gray area)

Top
North

with neighbor compute cards

 halo (message size) of 200x200x4
 Best HW mapping 1x8x8 partitions

East

West

‒ No partition in X fabric nodes
‒ 8 partitions in Y fabric nodes
‒ 8 partitions in Z fabric nodes

 Best algorithm mapping 4x4x4
‒ Less exchange than 1x8x8
4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8

South

Bottom
30
APPLICATION
ITERATIVE ALGORITHM

Core 1
Solve linear system 1

Core 2
Solve linear system 1

Core 512 (full chassis)
Solve linear system 1

Solve linear system 2

Solve linear system 2

Solve linear system 2

Solve linear system 7

Solve linear system 7

Solve linear system 7

Solve linear system 8

Solve linear system 8

Solve linear system 8

Bidirectional Exchange
of halo/communication with
neighbor domain hosted by
another core/processor/compute card
31

Overall convergence ?
Yes
End

Not yet
Linear system
Checkpointing

equations
values
APPLICATION
PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION

 Compute card with 1 CPU=1 NUMAnode.
 No chance for NUMA misses.
 Easy to leverage openMP within MPI code without having to worry about remote memory
accesses.
 Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet.
 3 Compute units can max-out memory controller bandwidth. (plenty of computing capability)
 1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with
computation stages.
 Single core per CPU for MPI communications:
‒ Aggregating halo data for all threads to send more data per message.
‒ leveraging MPI non blocking communications for halo exchange.
‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth
(0.6MB/message)
‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction.
32
APPLICATION
PERFORMANCE SUMMARY

 Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision
 (no cheating with weak scaling)
 Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling
all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis.
Computation
Mcells/
Sec

Speed
up
Wrt 8
cards

Efficiency
Wrt 8
cards

8

273

8

16

536

32
64

Comm overhead
wrt total time
Halo
exchange
5.5%

6%

15.7

98.1%

5.5%

7%

1065

31.2

97.5%

5.7%

8%

2048

60.0

93.7%

5.8%

11%

64

reduction

100%

3DFD speed up on Seamicro

No change in total volume exchanged
when increased card count, constant comm overhead.
33

Speed up

#
compute
cards

32
Speed up

16

Ideal

8

8

16

32

64

# compute cards

Expected increase, when increased card count,
as shown in Allreduce micro benchmark
CONCLUSIONS
 Proven suitability (ie. scalability) for 3D finite difference stencil computations,
leveraging latest software programming paradigms (MPI + openMP).
‒This is a proxy for many other High Performance Computing applications with
similar computational requirements (eg. Manufacturing, Oil and Gas,
Weather...)
 System Advantages:
‒High computing density (performance and performance per Watt) in 10U form
factor
‒Per compute/storage card
‒Scalability provided through Seamicro fabric
‒High flexibility in compute, network, storage configurations adjusted to your
workload requirements as demonstrated in this application.
34
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names
are for informational purposes only and may be trademarks of their respective owners.

35

More Related Content

What's hot

Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacGanesan Narayanasamy
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungAMD Developer Central
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersKazuaki Ishizaki
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentationVishal Singh
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauAMD Developer Central
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 

What's hot (20)

Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
CPU vs. GPU presentation
CPU vs. GPU presentationCPU vs. GPU presentation
CPU vs. GPU presentation
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 

Similar to CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Joshua Mora
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxsyed rafi
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit erarinnocente
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
VJITSk 6713 user manual
VJITSk 6713 user manualVJITSk 6713 user manual
VJITSk 6713 user manualkot seelam
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttecTTEC
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerFörderverein Technische Fakultät
 
SBC6020 SAM9G20 based Single Board Computer
SBC6020 SAM9G20 based Single Board ComputerSBC6020 SAM9G20 based Single Board Computer
SBC6020 SAM9G20 based Single Board Computeryclinda666
 
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsIntroducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsPremier Farnell
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-ModuleMYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-ModuleLinda Zhang
 

Similar to CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora (20)

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptx
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit era
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
uCluster
uClusteruCluster
uCluster
 
Zynq ultrascale
Zynq ultrascaleZynq ultrascale
Zynq ultrascale
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
VJITSk 6713 user manual
VJITSk 6713 user manualVJITSk 6713 user manual
VJITSk 6713 user manual
 
Practica 2
Practica 2Practica 2
Practica 2
 
Computer components
Computer componentsComputer components
Computer components
 
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttec
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
 
SBC6020 SAM9G20 based Single Board Computer
SBC6020 SAM9G20 based Single Board ComputerSBC6020 SAM9G20 based Single Board Computer
SBC6020 SAM9G20 based Single Board Computer
 
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard PeripheralsIntroducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
Introducing OMAP-L138/AM1808 Processor Architecture and Hawkboard Peripherals
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Computer Maintanance
Computer MaintananceComputer Maintanance
Computer Maintanance
 
Managing hardware assets
Managing hardware assetsManaging hardware assets
Managing hardware assets
 
A42060105
A42060105A42060105
A42060105
 
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-ModuleMYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
MYC-Y6ULX CPU Module - NXP i.MX 6UL/6ULL System-on-Module
 

More from AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora

  • 1. PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS ON SEAMICRO FABRIC COMPUTE SYSTEMS JOSHUA MORA
  • 2. ABSTRACT  Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric).  This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications.  Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics).  We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system.  We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern. 2
  • 3. AGENDA HW overview ‒Chassis, compute/storage cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability. 3
  • 4. HW OVERVIEW CHASSIS: FRONT AND BACK VIEWS FRONT 4 BACK
  • 5. HW OVERVIEW CHASSIS: SIDE VIEW 5 Total of 8 storage cards x 8 drives each plugged at the front Total of 4 (quadrants) x 16 compute cards plugged at both sides
  • 6. HW OVERVIEW COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8 RAM  AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz CPU FB 1 6 FB 2 PCI chipset FB 8
  • 7. HW OVERVIEW Core 6 Core 7  2.0GHz core frequency Core 4 Core 5  8 “Piledriver” cores, AVX, FMA3/4 Core 2 Core 3  AMD OpteronTM 4365EE processor Core 0 Core 1 COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8 L2 L2 L2 L2  40W TDP Northbridge HT PHY  Max Turbo core frequency up to 3.1GHz DRAM CTL L3 cache 2 memory channels 7 nCHT
  • 8. HW OVERVIEW STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS  Support for RAID and non RAID  8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB,  Or 8 SSD drives, 80GB-2TB  System can operate without disks CPU FB 1 Disk1 H 8 FB 2 Disk2 FB 2 RAM FB 1 FB 8 PCI chipset SATA FB 8 Disk8
  • 9. HW DESCRIPTION/OVERVIEW MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE  2 x 10Gb Ethernet Module ‒ External ports ‒ 2 Mini SAS ‒ 2 x 10GbE SFP+ ‒ External Port Bandwidth: ‒ 20 Gbps Full Duplex ‒ Internal Bandwidth to Fabric: ‒ 32 Gbps Full Duplex.  8 x 1Gb Ethernet Module ‒ External ports ‒ 2 Mini SAS ‒ 8x 1GbE 1000BaseT ‒ External Port Bandwidth: ‒ 8 Gbps Full Duplex ‒ Internal Bandwidth to Fabric: 9 ‒ 32 Gbps Full Duplex.
  • 10. HW OVERVIEW FABRIC TOPOLOGY: 3D TORUS  3D torus network fabric  8 x 8 x 8 Fabric nodes  Diameter (max hop) 4 + 4 + 4 = 12  Theor. cross section bandwidth = 2 (periodic) x 8 x 8 (section) x 2(bidir) x 2.0Gbps/link = 512Gb/s  Compute, storage, mgmt cards are plugged into the network fabric.  Support for hot plugged compute cards. 10
  • 11. AGENDA HW overview ‒Chassis, compute/storage/management cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability. 11
  • 12. SW STACK DESCRIPTION  Overall System Management ‒ Command Line Interface  NOTHING AT ALL CUSTOM FOR INSTALLATION ‒OS support ‒Linux (RH, SLES, CentOS, Ubuntu), Windows® ‒Virtualization ‒VMware, Xen, KVM, HyperV ‒Network SW stack ‒Everything that runs on top of Ethernet HW. ‒File systems ‒Local, shared , parallel. ‒Distributed memory programming ‒MPI, UPC, .. 12
  • 13. AGENDA HW overview ‒Chassis, compute/storage/managment cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability. 13
  • 14. MICRO-BENCHMARKS CPU, POWER  Benchmark HPL, leveraging FMA4,3  Single Compute card ‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card ‒ 40W TDP processor, 60W per compute card running HPL ========================================================================= T/V N NB P Q Time Gflops -------------------------------------------------------------------------------WR01L2L4 40000 100 2 4 795.23 5.366e+01 (83% efficiency) -------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033267 ...... PASSED =========================================================================  Chassis with 64 compute cards ‒ 2.95 DP Teraflops/sec/ chassis ‒ 72% HPL Efficiency/ chassis (MPI over Ethernet) ‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans). 14
  • 15. MICRO-BENCHMARKS MEMORY, POWER  Benchmark STREAM  Single Compute card @ 1333MHz memory frequency ‒ 15GB/s Function Best Rate MB/s Avg time Min time Max time Copy: 14647.1 0.181456 0.181333 0.181679 Scale: 15221.7 0.175883 0.175615 0.176168 Add: 14741.2 0.270838 0.270557 0.271005 Triad: 15105.2 0.269939 0.269585 0.270251 ‒ Power 15 W idle per card. ‒ Power 30 W stream per card.  STREAM chassis ‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis. ‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans). 15
  • 16. MICRO-BENCHMARKS NIC TUNING  Ethernet related tuning: - Ethernet Driver, 8.0.35-NAPI - InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options) - MTU 9000 (ifconfig)  Interrupt balance fabric nodes to different cores.  MPI TCP tuning - -mca btl_tcp_if_include eth0,eth2,…eth6,eth7 - -mca btl_tcp_eager_limit 1mb (default is 64kb)  UPC tuning ‒ using UDP instead of MPI+TCP 16
  • 17. MICRO-BENCHMARKS NIC TUNING  MPI related tuning:  8 Ethernet networks, one per fabric node across all 64 compute cards.  OpenMPI, with Ethernet, TCP, defaults to use all networks. Can be restricted with arguments passed to mpirun command or in openmpi.conf file - -mca btl_tcp_if_include eth0,eth2,…eth6,eth7  Point to Point communications Latency: 30-36 usec Bandwidth: linear scaling from 1 to 8 fabric nodes 1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional 8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional 17
  • 18. MICRO-BENCHMARKS NETWORK PERFORMANCE FB 1 FB 1 FB 2 FB 2 CPU FB 1 FB 2 PCI chipset FB 8 FB 8 FB 8 RAM FB 8 PCI chipset RAM CPU FB 1 FB 2  Point 2 Point benchmark setup  Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8 between any 2 compute cards in the chassis. 18
  • 19. MICRO-BENCHMARKS Bidirectional MPI bandwidth NETWORK PERFORMANCE eth0 eth0-1 eth0-eth3 eth0-eth7 1600 1500MB/s 1400 Unidirectional MPI bandwidth eth0-1 eth0-eth3 eth0-eth7 1000 900 960MB/s bandwidth (MB/s) 800 700 600 500 400 1000 800 600 400 300 200 200 100 120MB/s 0 1 19 bi bandwidth (MB/s) eth0 1200 16 256 4096 65536 1048576 msg size (bytes) 195MB/s 0 1 16 256 4096 65536 1048576 msg size (bytes)
  • 20. MICRO-BENCHMARKS NETWORK PERFORMANCE FB 1 FB 1 FB 1 FB 2 FB 2 FB 2 c2 FB 1 c0 FB 2  Message rate setup: c4 c2 FB 8 FB 8 FB 8 c4 FB 8 c6 c0 c6  Every other core sending messages through each fabric node to another core on another compute card. 20  4 pairs of MPI processes sending data striped across the 8 fabric nodes until maxing out bandwidth of the fabric.
  • 21. MICRO-BENCHMARKS NETWORK PERFORMANCE  4KB message rate scalability 1,2,4,8 fabric nodes  Maxing out network bandwidth 4KB MPI message rate (bw max out) 1 MPI pair 2 MPI pairs 4 MPI pairs 4KB MPI messages/second 250000 240K 4K msg/s 200000 160K 4K msg/s 120K 4K msg/s 150000 100000 50000 0 1 21 2 3 4 5 6 number of fabric nodes per compute card 7 8
  • 22. MICRO-BENCHMARKS NETWORK PERFORMANCE  Allreduce setup, for inner products. <x,y> = 64 1 𝑥𝑖 ∗ 𝑦𝑖  Models well with binary tree algorithm ~30usec*log2(64cards) = 180usec Notice: application described later, has as Xi*Yi as multithreaded (OpenMP) inner product + reduction followed by MPI_Allreduce. # c cards 2 16 110.31 32 MPI broadcast 82.59 138.66 64, chassis MPI reduce 54.41 8 22 25.97 4 + Elapsed time (usec) 170.02
  • 23. MICRO-BENCHMARKS NETWORK PERFORMANCE Y plane 0 X plane 7 FB  MPI multirail stripping messages compute card.  Aggregated in Y plane, distributed. X plane 0FB FB FB  2 pairs (green and purple) cross Z-section without congestion on the links (orange). FB Z - section across all the fabric nodes within ccard FB FB FB FB FB X plane 7 FB FB Z - section  Aggregated bandwidth in X plane, FB FB FB FB X plane 0FB FB FB FB FB Z - section  Sectioned in Z plane. Y plane 7 FB FB Z - section  Cross section bandwidth measurement. FB FB FB FB  Links still not saturated.  8 Xplanes * 8 Yplanes * 4 pairs *1500Mbit/s bidir ASIC [measured] = 384000Mb/s = 48 GB/s.  Measured 43.5GB/s (90.6% network bandwidth utilization) using only 1 core per compute card. 23 FB FB FB FB FB
  • 24. MICRO-BENCHMARKS STORAGE PERFORMANCE  Sustained writes (OS caching not leveraged)  1 Vdisk, SATA HDD 7.2k rpms, 64MB cache For checkpointing: ‒ Iozone sustained writes 45MB/s, 2GB file, 1MB record length.  64 Vdisks concurrently, 1 Vdisk per compute card ‒ Iozone sustained writes 2.88GB/s entire chassis, local file systems. 64 x 2GB files , 1MB record length.  Depending on configuration with same disks can reach up to 95MB/s sustained writes per compute card. 95MB/s x 64disks = 6 GB/s entire chassis 24
  • 25. AGENDA HW overview ‒Chassis, compute/storage/management cards, fabric SW stack description ‒OS, Virtualization, MPI, File system Micro-benchmarks ‒CPU, memory, network, storage Application ‒Equations, computation, communication, check-pointing, scalability. 25
  • 26. APPLICATION EQUATIONS AND HIGH ORDER SCHEMES  Equations ‒Navier Stokes, Wave, Heat-Mass transfer..  Discretization of 3D Laplace’s equation 𝜕2 𝑓 + 𝜕2 𝑥 𝜕2 𝑓 + 𝜕2 𝑦 𝜕2 𝑓 =0 𝜕2 𝑧 ‒ 8th order scheme, central difference scheme W4 Derivative Accuracy 2 26 8 W3 W2 W1 P E1 E2 E3 E4 −4 −3 −2 −1 0 1 2 3 4 −1/560 8/315 −1/5 8/5 8/5 −1/5 8/315 −1/560 - 𝟐𝟎𝟓 𝟕𝟐
  • 27. APPLICATION DISCRETIZATION  Derived Equation for unknown at position P x(i,j,k), 25 point stencil 𝑾𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟒, 𝒋, 𝒌) + 𝑾𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟑, 𝒋, 𝒌) + 𝑾𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊 − 𝟐, 𝒋, 𝒌 + E𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟒, 𝒋, 𝒌) + 𝑬𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟑, 𝒋, 𝒌) + 𝑬𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊 + 𝟐, 𝒋, 𝒌 + E𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟏, 𝒋, 𝒌) + + S𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟒, 𝒌) + 𝑺𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟑, 𝒌) + 𝑺𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋 − 𝟐, 𝒌 + S𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟏, 𝒌) + + N𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟒, 𝒌) + N𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟑, 𝒌) + 𝑵𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋 + 𝟐, 𝒌 + N𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟏, 𝒌) + + B𝟒 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 − 𝟒 + 𝐁𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 − 𝟐 + 𝑩𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 − 𝟑) + + T𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟒) + T𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟑) + 𝑻𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 + 𝟐 + 𝑾𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟏, 𝒋, 𝒌) + 𝐁𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 − 𝟏) + + T𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟏) + z + P(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌) = 0  The coefficients express how strong is the relationship in the vicinity of P. 25 coef, 25mult, 25 adds,…lots of FMAs per eq. P  Linear system of equations to map the domain. 27 A*x=b, A, square sparse matrix of coefficients (25 diagonals), x, vector of unknowns, b, vector of boundary conditions x y
  • 28. APPLICATION COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION  Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less than 5 to 7: U, V, W, P, T, k, e)  Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core.  25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~ 30 vectors (in 3D).  1 Single Precision (SP) FLOAT is 4bytes. 3 1GB∗(1SPF/4B)/30eq ≈ 200 points in each direction per core.  Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points  Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of 4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning.  200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear system at every computation of 200x200x200 points.  200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache. 28
  • 29. APPLICATION IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO  Advantages when using high order schemes: ‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy. ‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of vector instructions. Implementation dependent. Otherwise extremely memory bound. ‒Better network bandwidth utilization due to larger message size of halo at higher order scheme. ‒ Tradeoff: Higher communication volume for higher order scheme ‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks) for neighboring communications. ‒Larger messages provide more chances to overlap communication with computation. ‒More network latency tolerant. 29
  • 30. APPLICATION P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE     8 cores per compute card. Multithreaded computation with OpenMP threads. Threading only in k loops (i,j,k) In general case, 6 exchanges (gray area) Top North with neighbor compute cards  halo (message size) of 200x200x4  Best HW mapping 1x8x8 partitions East West ‒ No partition in X fabric nodes ‒ 8 partitions in Y fabric nodes ‒ 8 partitions in Z fabric nodes  Best algorithm mapping 4x4x4 ‒ Less exchange than 1x8x8 4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8 South Bottom 30
  • 31. APPLICATION ITERATIVE ALGORITHM Core 1 Solve linear system 1 Core 2 Solve linear system 1 Core 512 (full chassis) Solve linear system 1 Solve linear system 2 Solve linear system 2 Solve linear system 2 Solve linear system 7 Solve linear system 7 Solve linear system 7 Solve linear system 8 Solve linear system 8 Solve linear system 8 Bidirectional Exchange of halo/communication with neighbor domain hosted by another core/processor/compute card 31 Overall convergence ? Yes End Not yet Linear system Checkpointing equations values
  • 32. APPLICATION PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION  Compute card with 1 CPU=1 NUMAnode.  No chance for NUMA misses.  Easy to leverage openMP within MPI code without having to worry about remote memory accesses.  Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet.  3 Compute units can max-out memory controller bandwidth. (plenty of computing capability)  1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with computation stages.  Single core per CPU for MPI communications: ‒ Aggregating halo data for all threads to send more data per message. ‒ leveraging MPI non blocking communications for halo exchange. ‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth (0.6MB/message) ‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction. 32
  • 33. APPLICATION PERFORMANCE SUMMARY  Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision  (no cheating with weak scaling)  Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis. Computation Mcells/ Sec Speed up Wrt 8 cards Efficiency Wrt 8 cards 8 273 8 16 536 32 64 Comm overhead wrt total time Halo exchange 5.5% 6% 15.7 98.1% 5.5% 7% 1065 31.2 97.5% 5.7% 8% 2048 60.0 93.7% 5.8% 11% 64 reduction 100% 3DFD speed up on Seamicro No change in total volume exchanged when increased card count, constant comm overhead. 33 Speed up # compute cards 32 Speed up 16 Ideal 8 8 16 32 64 # compute cards Expected increase, when increased card count, as shown in Allreduce micro benchmark
  • 34. CONCLUSIONS  Proven suitability (ie. scalability) for 3D finite difference stencil computations, leveraging latest software programming paradigms (MPI + openMP). ‒This is a proxy for many other High Performance Computing applications with similar computational requirements (eg. Manufacturing, Oil and Gas, Weather...)  System Advantages: ‒High computing density (performance and performance per Watt) in 10U form factor ‒Per compute/storage card ‒Scalability provided through Seamicro fabric ‒High flexibility in compute, network, storage configurations adjusted to your workload requirements as demonstrated in this application. 34
  • 35. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 35