NumaConnect
Einar Rustad, Co-Founder & CTO
September 2013

Copyright 2013

All rights reserved.

1
NumaConnect
• Cache Coherent Global Shared Memory
and Shared IO
• Single Image Standard Operating System
• All kinds of APIs
Commodity Servers
Tightly Coupled into
One Monolithic System
with NumaConnect

At Cluster Price
Copyright 2013. All rights reserved.

2
NumaConnect - Share Everything

Shared Everything - One Single Operating System Image
Memory
Caches

Remote
Cache

NumaChip

Memory
Caches

Remote
Cache

NumaChip

Memory
Caches

Remote
Cache

NumaChip

Memory
Caches

CPUs

CPUs

CPUs

I/O

I/O

NumaChip

CPUs

I/O

Remote
Cache

I/O

NumaConnect Fabric (no switch required)

Copyright 2013. All rights reserved.

3
Technology Background
Convex Exemplar (Acquired by HP)
– First implementation of the ccNUMA architecture
from Dolphin in 1994

Data General Aviion (Acquired by EMC)
– Designed in 1996 with deliveries from 1997 - 2002
– Used Dolphin’s chips with 3 generations of
Dolphin’
processor/memory buses

I/O Attached Products for Clustering OEMs
–
–
–
–
–

Sun Microsystems (SunCluster)
SunCluster)
Siemens RM600 Server (IO Expansion)
Siemens Medical (3D CT)
Philips Medical (3D Ultra Sound)
Dassault/Thales Rafale

Dolphin’s
Cache Chip

Dolphin’s Low
Latency Clustering HW

HPC Clusters (WulfKit w. Scali)
Scali)
– First Low Latency Cluster Interconnect
Copyright 2013. All rights reserved.

4
NumaChip

IBM Microelectronics
ASIC
FCPBGA1023, 33mm x
33mm, 1mm ball pitch,
4-2-4 package
IBM cu-11 Technology
~ 2 million gates
Chip Size 9x11mm

Copyright 2013. All rights reserved.

5
NumaConnect Card

Copyright 2013. All rights reserved.

6
NumaChip Top Block Diagram
HyperTransport
SPI

SM

ccHT
Cave
SDRAM Cache

H2S

SDRAM Tags

CSR

Microcode

LC Config Data

SPI Init
Module

SCC

Crossbar switch, LCs, SERDES

XA

XB

YA

YB

ZA

ZB

Copyright 2013. All rights reserved.

7
NumaConnect™
NumaConnect™ Node Configuration

Memory
Memory
Memory
Memory

Memory
Memory
Memory
Memory

MultiMulti-Core
CPU

MultiMulti-Core
CPU
I/O
Bridge

Coherent HyperTransport

Memory
Numa

NumaChip

Cache+Tags

6 x4 SERDES links

Copyright 2013. All rights reserved.

8
NumaConnect™ System Architecture

MultiMulti-Core
CPU

Memory
Numa
Cache

2-D Torus

NumaChip

MultiMulti-Core
CPU

3-D Torus

I/O
Bridge

Memory
Memory
Memory
Memory

Memory
Memory
Memory
Memory

Multi-CPU Node

6 external links - flexible system configurations in multimultidimensional topologies

Copyright 2013. All rights reserved.

9
Cabling Example

Copyright 2013. All rights reserved.

10
2-D Dataflow
Memory
Memory
Memory
Memory

CPUs
NumaChip

Caches

Memory
Memory
Memory
Memory

CPUs

Memory
Memory
Memory
Memory

CPUs

NumaChip

Caches

NumaChip

Caches

Request
Response
Copyright 2013. All rights reserved.

11
Size does Matter

640K ought to be enough for anybody
– Bill Gates, 1981

We are looking for systems that can hold
10 – 20 TeraBytes in main Memory
– Trond J. Suul, Statoil, 2010
Copyright 2013. All rights reserved.

12
ScaleScale-up Capacity

Single System Image or Multiple
Partitions
Limits
– 256 TeraBytes Physical Address Space
– 4096 Nodes
– 196 608 cores

Largest and Most Cost Effective
Coherent Shared Memory
NO Virtualization SW Layer!
Copyright 2013. All rights reserved.

13
Performance

Copyright 2013

All rights reserved.

14
LMbench - Chart
Latency - LMbench
10000

1087,576

1000
308,61

287,01

304,582

147,125
100

16,732
10

6,421

1,251

64
12
8
25
6
51
2
10
24
20
48
40
96

32

16

8

4

2

1

1

0,
00
0
0, 49
00
1
0, 95
00
3
0, 91
00
7
0, 81
01
5
0, 62
03
12
0, 5
06
25
0,
12
5
0,
25
0,
5

Nano
oseconds

628,583

Array Size (MBytes)

Copyright 2013. All rights reserved.

15
The Memory Hierarchy
Access Times in the Memory Hierarchy
10 000 000

SSD; 100 000

Rotating Disk; 5 000 000

1 000 000

nx1TB

nx5TB

240GB

4GB

Remote Memory; 1 087
7

NumaCache; 308

10

L3 Cache ; 16,7

100

L2 Cache; 6,4

1 000

Local Memory; 90

10 000

L1 Cache; 1,3

Nanosec
conds

100 000

1
Typical size: 100KB

1MB

8MB

Copyright 2013. All rights reserved.

< 256TB

16
Memory Bandwidth
------------------------------------------------------------STREAM version $Revision: 5.9 $
------------------------------------------------------------This system uses 8 bytes per DOUBLE PRECISION word.
------------------------------------------------------------Array size = 180000000000, Offset = 0
Total memory required = 4119873.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
------------------------------------------------------------Number of Threads requested = 828
------------------------------------------------------------Function

Rate (MB/s)

Avg time

Min time

University of Oslo:
• 72 nodes - IBM x3755
• 1 728 cores
• 4.6 TBytes Shared Memory

Max time

Copy:

1599317.5028

1.9224

1.8008

2.1393

Scale:
Scale:

1468219.1643

2.0954

1.9616

2.2290

Add:
Add:

1664455.1221

2.8375

2.5954

3.0947

Triad:

1492414.0721

3.0478

2.8946

3.3267

Copyright 2013. All rights reserved.

17
Memory Allocation and Initialization
Time to Allocate and Ini alize 15GB Array in Parallel
500,0

500

450

450,0
449,0

432,5

423,4

400,0

450
400
350

300,0

300

250,0

250

222
200,0

Ra o

Seco
onds

350,0

200

211,1

150,0

150

112

100,0
50,0

23,40

0,0

9

100

1

50
3,77

2,02

0,96
0

8

16

104

Number of Processes
ScaleMP

Numascale

Copyright 2013. All rights reserved.

Ra o
18
MPI Latency (Pallas)

Micros
seconds

NumaConnect MPI Latency
7
6
5
4
3
2
1
0
0

1

2

4

8 16 32 64 128 256 512 1k
Message Size (Bytes)

Copyright 2013. All rights reserved.

19
MPI Barrier
MPI_Barrier on 32 Nodes
250

25,0
NumaConnect-BTL
19,9
18,3

20,0

13,8

15,0

102,66

4,5

10,0

5,0
8,52

7,43

4,06

52,67

50
8,09

155,84

100

213,38

13,0

10,72

150

1,78

Microse
econds

Ra o

Improvem
ment Ra o

Standard-SM-BTL

200

0

0,0
2

4

8

16

32

Number of Processes
Copyright 2013. All rights reserved.

20
MPI Barrier (Pallas)
MPI_Barrier, 2 - 256 processes
1200,0

25,0
23,1
NumaConnect-BTL
20,0

Standard-SM-BTL

20,0

Ra o
15,0
13,1
969,3

600,0

10,0

0,0

2

4

8

510,9

5,0
48,5

1,2

22,1

0,7

14,3
330,7

0,7

10,2
132,5

3,6
2,7

6,2
7,2

200,0

14,3
330,7

400,0

1,3
0,9

Microsec
conds

800,0

Improvemen Ra o
nt

1000,0

0,0
16

32

64

128

256

Number of MPI Processes
Copyright 2013. All rights reserved.

21
Scaling Applications with
NumaConnect
Atle Vesterkjær, Numascale
av@numascale.com
September 2013

Copyright 2013

All rights reserved.

22
RWTH TrajSearch code
OpenMP programs can be made numa-aware by decomposing memory and making sure that all memory
numais “local” to the CPU running the process that is using the memory. Affinity settings can be used to
distribute jobs in a numa-aware way. The graph shows that the application has the most speedup on
numaNumaConnect,
NumaConnect, even if it was originally adapted to ScaleMP.
ScaleMP.
140000

300

120000

250

100000

80000
150

Sp
peedup

Runtime in Seconds
me

200

60000
100

40000

50

20000

0

0
8

16

32

64

128

256

512

Number of Threads

SGI Altix UV: Nehalem EX

SCALEMP: Nehalem EX

Bull BCS: Nehalem EX

SCALEMP: SandyBridge EP

NUMASCALE

SGI Altix UV: Nehalem EX-Speedup

SCALEMP: Nehalem EX-Speedup

Bull BCS: Nehalem EX-Speedup

SCALEMP: SandyBridge EP-Speedup

Numascale-Speedup

Copyright 2013. All rights reserved.

23
Stream on 16 Nodes
Applications that are scalable by design achieve great results on
a NumaConnect Shared Memory System.
Streams scaling (16 nodes - 32 sockets - 192 cores)
200 000

150 000

Triad – NumaConnect
Triad – 1041A-T2F

MByte/sec

Triad – Linear socket
Triad – Linear node
100 000

50 000

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Number of CPU Sockets

Copyright 2013. All rights reserved.

24
Porting MPI applications to NC-OpenMP
NC-

OpenMP programs are faster and has less
overhead than MPI programs.
Both numa-aware programs and MPI
numaprograms care about memory locality. It is
locality.
therefore often possible to take an MPI
program and convert it to a numa-aware
numaOpenMP program that will have shorter
runtime.
In order to demonstrate this the NAS
Parallel Benchmark SP has been used.

Copyright 2013. All rights reserved.

25
NPB SP D-class Mop/s total
DThe SP benchmark exists
in both an OpenMP
version written for
standalone servers and
an MPI version written
for distributed servers.

The graph shows the
scalability for the SP
benchmark code when
converted to a
NumaConnect optimized
version.

70000

60000

62988,6

54029,36
50000

Mop/s Total

As the computational job
is the same we find the
code written for
distributed servers (MPI)
easier to convert to a
NumaConnect optimized
version since we also
need to consider
memory locality.

NPB-SP NC-OPENMP
256 cores
8 NumaConnect Nodes
CLASS=D

40000
35600,82

30000

20700,87
20000

10000

0
16

36
64
Number of threads

Copyright 2013. All rights reserved.

121

26
NPB SP D-class runtime
D1600

1426,8

1400

1200

1000
Runtime [sec]

The
overhead
introduced
by MPI is not
needed when
we are
running on a
Shared
Memory
System

NPB-SP NC-OPENMP D-CLASS
8 NumaConnect Nodes: Runtime [sec]

829,64

800

600
546,67
468,91
400

200

0
16

36

64

121

Number of threads
Copyright 2013. All rights reserved.

27
NPB SP D-class runtime affinity effect
DThe figure on
the left
illustrates the
great scaling
we get from the
NPBNPB-SP on the
NumaConnect
system

These effects will
be analyzed
more in the next
slides

6000
5701,24

1600

5000

1426,8

1400

1200
4000
1000
Runtime (sec)

Runtime (sec)

The figure on
the right
illustrates that
using affinity to
distribute the
job evenly
between the
different
NumaConnect
Nodes (and
cores in the
system) leads to
shorter
runtime

NPB-SP NC-OPENMP D-CLASS
Runtime (sec) where the
threads are evenly distributed
on all NumaConnect Nodes

NPB-SP NC-OPENMP D-CLASS
Runtime (sec) bound to only the
first <n> cores

3000
2496,88
2000

829,64

800

600
546,67
1512,98

468,91
400

1000

967,72
200

0
16

36

64

121

Number of processes

0
16

36

64

121

Number of processes
Copyright 2013. All rights reserved.

28
NPB SP D-class runtime
DThe runtime using 121
cores is cut to half
when running on 8
NumaConnect
Nodes, compared to 4
NumaConnect Nodes

The picture on the
right shows that the
The AMD Opteron™
6300 series processor
is organized as a
MultiMulti-Chip Module
(two CPU dies in the
package).
package). Each die
has 8 cores, but 4
FPUs

1200

1000

998,86

800
Runtime (sec)
sec)

The number of cores
pr. FPUs is one when
running on 8
NumaConnect Nodes
and two when running
on 4 NumaConnect
Nodes.
Nodes.

NPB-SP NC-OPENMP D-CLASS
8 NumaConnect Nodes: Runtime [sec]
vs Affinity 121 cores using SP CLASS D

600
475,28
400

200

0
0-127

0-127:2
Affinty

Copyright 2013. All rights reserved.

29
NPB SP D-class runtime
DRuntime (sec) decreasing
when moving from 4 to 8
NumaConnect Nodes using
affinity settings and
SP CLASS D

The runtime using 64
cores is cut to half when
running on 8
NumaConnect
Nodes, compared to 4
NumaConnect Nodes

900

In this test the number
of FPUs is constant

800

The scaling is good when
using more nodes

800,69

700

The AMD Opteron™ 6300
series memory controller
interface allow external
connection of up to two
memory channels per
die. This memory
interface is shared
among all the CPU core
pairs on each die.

Runtime (sec)

600
548,13
500
400

300

200
100

0
0-127:2

0-255:4
Affintity settings

Copyright 2013. All rights reserved.

30
NPB SP D-class runtime MPI vs OpenMP
DNPB-SP NC-OPENMP D-CLASS
Runtime (sec) where the threads are
evenly distributed on all NumaConnect
Nodes

The runtime when using NPB SP DDclass is the same for the MPI
version as when using NC-OpenMP
NCwhen the number of cores is 121 or
less.
This MPI library overhead in the
communication part of the runtime
is not significant for these small
sizes.

1600
1426,8

1400

OpenMP

1200

MPI
Runtime (sec)

1000
829,64

800

600
546,67
468,91
400

200

0
16

36

64

121

Number of processes

Copyright 2013. All rights reserved.

31
NPB SP D-class runtime MPI vs OpenMP
DThe runtime when using NPB SP DDclass is higher for MPI compared to
the NC-OpenMP version when the
NCnumber of cores is 144 or higher.
The MPI library overhead in the
communication part of the runtime
now plays a larger role.

1200

NPB-SP NC-OPENMP D-CLASS
Runtime (sec) bound to only the first <n>
cores

1000
953,39

800
729,24
Runtime (sec)

701,58

OpenMP

708,68

MPI

600

599,04
554,57

534,34
478,14
415,93

400

200

0
144

Copyright 2013. All rights reserved.

169

196
Number of processes

225

256

32
NPB SP E-class runtime
EThe SP benchmark Eclass scales perfectly
from 64 processes (using
(using
affinity 0-255:4) to 121
processes ( using affinity
0-241:2).

Runtime [sec]: NPB-SP NC-OPENMP ECLASS
8 NumaConnect Nodes
18000

General statement:
Larger problems scales
better.

15242,11
14000
12000
Runtime

E-class problems are
more representative for
large shared memory
systems and clusters.

16000

10000
8000
7246,13
6000
4000
2000
0
64

121
Number of processes

Copyright 2013. All rights reserved.

33
NPB SP E-class runtime
EThe SP benchmark EEclass scales good
from 16 to 121 on a
32 node system with
a total of 384 cores
NB: The test with 384
cores has been run on
an old system and are
presented to show
scaling and not
absolute values.

Runtime: NPB-SP NC-OPENMP E-CLASS
32 NumaConnect Nodes: "Time in seconds"
60000

53922,05
50000

40000

32481,25
30000

20000

19740,37

11247,03

10000

0
0-383:24

0-383:10

Copyright 2013. All rights reserved.

0-383:6

0-363:3
34
Big Data with Shared Memory

“Any application requiring a large memory
footprint can benefit from a shared memory
computing environment.”
William W. Thigpen, Chief, Engineering
Branch, NASA Advanced Supercomputing (NAS)
Division

Copyright 2013. All rights reserved.

35

Numascale Product IBM

  • 1.
    NumaConnect Einar Rustad, Co-Founder& CTO September 2013 Copyright 2013 All rights reserved. 1
  • 2.
    NumaConnect • Cache CoherentGlobal Shared Memory and Shared IO • Single Image Standard Operating System • All kinds of APIs Commodity Servers Tightly Coupled into One Monolithic System with NumaConnect At Cluster Price Copyright 2013. All rights reserved. 2
  • 3.
    NumaConnect - ShareEverything Shared Everything - One Single Operating System Image Memory Caches Remote Cache NumaChip Memory Caches Remote Cache NumaChip Memory Caches Remote Cache NumaChip Memory Caches CPUs CPUs CPUs I/O I/O NumaChip CPUs I/O Remote Cache I/O NumaConnect Fabric (no switch required) Copyright 2013. All rights reserved. 3
  • 4.
    Technology Background Convex Exemplar(Acquired by HP) – First implementation of the ccNUMA architecture from Dolphin in 1994 Data General Aviion (Acquired by EMC) – Designed in 1996 with deliveries from 1997 - 2002 – Used Dolphin’s chips with 3 generations of Dolphin’ processor/memory buses I/O Attached Products for Clustering OEMs – – – – – Sun Microsystems (SunCluster) SunCluster) Siemens RM600 Server (IO Expansion) Siemens Medical (3D CT) Philips Medical (3D Ultra Sound) Dassault/Thales Rafale Dolphin’s Cache Chip Dolphin’s Low Latency Clustering HW HPC Clusters (WulfKit w. Scali) Scali) – First Low Latency Cluster Interconnect Copyright 2013. All rights reserved. 4
  • 5.
    NumaChip IBM Microelectronics ASIC FCPBGA1023, 33mmx 33mm, 1mm ball pitch, 4-2-4 package IBM cu-11 Technology ~ 2 million gates Chip Size 9x11mm Copyright 2013. All rights reserved. 5
  • 6.
    NumaConnect Card Copyright 2013.All rights reserved. 6
  • 7.
    NumaChip Top BlockDiagram HyperTransport SPI SM ccHT Cave SDRAM Cache H2S SDRAM Tags CSR Microcode LC Config Data SPI Init Module SCC Crossbar switch, LCs, SERDES XA XB YA YB ZA ZB Copyright 2013. All rights reserved. 7
  • 8.
  • 9.
    NumaConnect™ System Architecture MultiMulti-Core CPU Memory Numa Cache 2-DTorus NumaChip MultiMulti-Core CPU 3-D Torus I/O Bridge Memory Memory Memory Memory Memory Memory Memory Memory Multi-CPU Node 6 external links - flexible system configurations in multimultidimensional topologies Copyright 2013. All rights reserved. 9
  • 10.
    Cabling Example Copyright 2013.All rights reserved. 10
  • 11.
  • 12.
    Size does Matter 640Kought to be enough for anybody – Bill Gates, 1981 We are looking for systems that can hold 10 – 20 TeraBytes in main Memory – Trond J. Suul, Statoil, 2010 Copyright 2013. All rights reserved. 12
  • 13.
    ScaleScale-up Capacity Single SystemImage or Multiple Partitions Limits – 256 TeraBytes Physical Address Space – 4096 Nodes – 196 608 cores Largest and Most Cost Effective Coherent Shared Memory NO Virtualization SW Layer! Copyright 2013. All rights reserved. 13
  • 14.
  • 15.
    LMbench - Chart Latency- LMbench 10000 1087,576 1000 308,61 287,01 304,582 147,125 100 16,732 10 6,421 1,251 64 12 8 25 6 51 2 10 24 20 48 40 96 32 16 8 4 2 1 1 0, 00 0 0, 49 00 1 0, 95 00 3 0, 91 00 7 0, 81 01 5 0, 62 03 12 0, 5 06 25 0, 12 5 0, 25 0, 5 Nano oseconds 628,583 Array Size (MBytes) Copyright 2013. All rights reserved. 15
  • 16.
    The Memory Hierarchy AccessTimes in the Memory Hierarchy 10 000 000 SSD; 100 000 Rotating Disk; 5 000 000 1 000 000 nx1TB nx5TB 240GB 4GB Remote Memory; 1 087 7 NumaCache; 308 10 L3 Cache ; 16,7 100 L2 Cache; 6,4 1 000 Local Memory; 90 10 000 L1 Cache; 1,3 Nanosec conds 100 000 1 Typical size: 100KB 1MB 8MB Copyright 2013. All rights reserved. < 256TB 16
  • 17.
    Memory Bandwidth ------------------------------------------------------------STREAM version$Revision: 5.9 $ ------------------------------------------------------------This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------Array size = 180000000000, Offset = 0 Total memory required = 4119873.0 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------Number of Threads requested = 828 ------------------------------------------------------------Function Rate (MB/s) Avg time Min time University of Oslo: • 72 nodes - IBM x3755 • 1 728 cores • 4.6 TBytes Shared Memory Max time Copy: 1599317.5028 1.9224 1.8008 2.1393 Scale: Scale: 1468219.1643 2.0954 1.9616 2.2290 Add: Add: 1664455.1221 2.8375 2.5954 3.0947 Triad: 1492414.0721 3.0478 2.8946 3.3267 Copyright 2013. All rights reserved. 17
  • 18.
    Memory Allocation andInitialization Time to Allocate and Ini alize 15GB Array in Parallel 500,0 500 450 450,0 449,0 432,5 423,4 400,0 450 400 350 300,0 300 250,0 250 222 200,0 Ra o Seco onds 350,0 200 211,1 150,0 150 112 100,0 50,0 23,40 0,0 9 100 1 50 3,77 2,02 0,96 0 8 16 104 Number of Processes ScaleMP Numascale Copyright 2013. All rights reserved. Ra o 18
  • 19.
    MPI Latency (Pallas) Micros seconds NumaConnectMPI Latency 7 6 5 4 3 2 1 0 0 1 2 4 8 16 32 64 128 256 512 1k Message Size (Bytes) Copyright 2013. All rights reserved. 19
  • 20.
    MPI Barrier MPI_Barrier on32 Nodes 250 25,0 NumaConnect-BTL 19,9 18,3 20,0 13,8 15,0 102,66 4,5 10,0 5,0 8,52 7,43 4,06 52,67 50 8,09 155,84 100 213,38 13,0 10,72 150 1,78 Microse econds Ra o Improvem ment Ra o Standard-SM-BTL 200 0 0,0 2 4 8 16 32 Number of Processes Copyright 2013. All rights reserved. 20
  • 21.
    MPI Barrier (Pallas) MPI_Barrier,2 - 256 processes 1200,0 25,0 23,1 NumaConnect-BTL 20,0 Standard-SM-BTL 20,0 Ra o 15,0 13,1 969,3 600,0 10,0 0,0 2 4 8 510,9 5,0 48,5 1,2 22,1 0,7 14,3 330,7 0,7 10,2 132,5 3,6 2,7 6,2 7,2 200,0 14,3 330,7 400,0 1,3 0,9 Microsec conds 800,0 Improvemen Ra o nt 1000,0 0,0 16 32 64 128 256 Number of MPI Processes Copyright 2013. All rights reserved. 21
  • 22.
    Scaling Applications with NumaConnect AtleVesterkjær, Numascale av@numascale.com September 2013 Copyright 2013 All rights reserved. 22
  • 23.
    RWTH TrajSearch code OpenMPprograms can be made numa-aware by decomposing memory and making sure that all memory numais “local” to the CPU running the process that is using the memory. Affinity settings can be used to distribute jobs in a numa-aware way. The graph shows that the application has the most speedup on numaNumaConnect, NumaConnect, even if it was originally adapted to ScaleMP. ScaleMP. 140000 300 120000 250 100000 80000 150 Sp peedup Runtime in Seconds me 200 60000 100 40000 50 20000 0 0 8 16 32 64 128 256 512 Number of Threads SGI Altix UV: Nehalem EX SCALEMP: Nehalem EX Bull BCS: Nehalem EX SCALEMP: SandyBridge EP NUMASCALE SGI Altix UV: Nehalem EX-Speedup SCALEMP: Nehalem EX-Speedup Bull BCS: Nehalem EX-Speedup SCALEMP: SandyBridge EP-Speedup Numascale-Speedup Copyright 2013. All rights reserved. 23
  • 24.
    Stream on 16Nodes Applications that are scalable by design achieve great results on a NumaConnect Shared Memory System. Streams scaling (16 nodes - 32 sockets - 192 cores) 200 000 150 000 Triad – NumaConnect Triad – 1041A-T2F MByte/sec Triad – Linear socket Triad – Linear node 100 000 50 000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Number of CPU Sockets Copyright 2013. All rights reserved. 24
  • 25.
    Porting MPI applicationsto NC-OpenMP NC- OpenMP programs are faster and has less overhead than MPI programs. Both numa-aware programs and MPI numaprograms care about memory locality. It is locality. therefore often possible to take an MPI program and convert it to a numa-aware numaOpenMP program that will have shorter runtime. In order to demonstrate this the NAS Parallel Benchmark SP has been used. Copyright 2013. All rights reserved. 25
  • 26.
    NPB SP D-classMop/s total DThe SP benchmark exists in both an OpenMP version written for standalone servers and an MPI version written for distributed servers. The graph shows the scalability for the SP benchmark code when converted to a NumaConnect optimized version. 70000 60000 62988,6 54029,36 50000 Mop/s Total As the computational job is the same we find the code written for distributed servers (MPI) easier to convert to a NumaConnect optimized version since we also need to consider memory locality. NPB-SP NC-OPENMP 256 cores 8 NumaConnect Nodes CLASS=D 40000 35600,82 30000 20700,87 20000 10000 0 16 36 64 Number of threads Copyright 2013. All rights reserved. 121 26
  • 27.
    NPB SP D-classruntime D1600 1426,8 1400 1200 1000 Runtime [sec] The overhead introduced by MPI is not needed when we are running on a Shared Memory System NPB-SP NC-OPENMP D-CLASS 8 NumaConnect Nodes: Runtime [sec] 829,64 800 600 546,67 468,91 400 200 0 16 36 64 121 Number of threads Copyright 2013. All rights reserved. 27
  • 28.
    NPB SP D-classruntime affinity effect DThe figure on the left illustrates the great scaling we get from the NPBNPB-SP on the NumaConnect system These effects will be analyzed more in the next slides 6000 5701,24 1600 5000 1426,8 1400 1200 4000 1000 Runtime (sec) Runtime (sec) The figure on the right illustrates that using affinity to distribute the job evenly between the different NumaConnect Nodes (and cores in the system) leads to shorter runtime NPB-SP NC-OPENMP D-CLASS Runtime (sec) where the threads are evenly distributed on all NumaConnect Nodes NPB-SP NC-OPENMP D-CLASS Runtime (sec) bound to only the first <n> cores 3000 2496,88 2000 829,64 800 600 546,67 1512,98 468,91 400 1000 967,72 200 0 16 36 64 121 Number of processes 0 16 36 64 121 Number of processes Copyright 2013. All rights reserved. 28
  • 29.
    NPB SP D-classruntime DThe runtime using 121 cores is cut to half when running on 8 NumaConnect Nodes, compared to 4 NumaConnect Nodes The picture on the right shows that the The AMD Opteron™ 6300 series processor is organized as a MultiMulti-Chip Module (two CPU dies in the package). package). Each die has 8 cores, but 4 FPUs 1200 1000 998,86 800 Runtime (sec) sec) The number of cores pr. FPUs is one when running on 8 NumaConnect Nodes and two when running on 4 NumaConnect Nodes. Nodes. NPB-SP NC-OPENMP D-CLASS 8 NumaConnect Nodes: Runtime [sec] vs Affinity 121 cores using SP CLASS D 600 475,28 400 200 0 0-127 0-127:2 Affinty Copyright 2013. All rights reserved. 29
  • 30.
    NPB SP D-classruntime DRuntime (sec) decreasing when moving from 4 to 8 NumaConnect Nodes using affinity settings and SP CLASS D The runtime using 64 cores is cut to half when running on 8 NumaConnect Nodes, compared to 4 NumaConnect Nodes 900 In this test the number of FPUs is constant 800 The scaling is good when using more nodes 800,69 700 The AMD Opteron™ 6300 series memory controller interface allow external connection of up to two memory channels per die. This memory interface is shared among all the CPU core pairs on each die. Runtime (sec) 600 548,13 500 400 300 200 100 0 0-127:2 0-255:4 Affintity settings Copyright 2013. All rights reserved. 30
  • 31.
    NPB SP D-classruntime MPI vs OpenMP DNPB-SP NC-OPENMP D-CLASS Runtime (sec) where the threads are evenly distributed on all NumaConnect Nodes The runtime when using NPB SP DDclass is the same for the MPI version as when using NC-OpenMP NCwhen the number of cores is 121 or less. This MPI library overhead in the communication part of the runtime is not significant for these small sizes. 1600 1426,8 1400 OpenMP 1200 MPI Runtime (sec) 1000 829,64 800 600 546,67 468,91 400 200 0 16 36 64 121 Number of processes Copyright 2013. All rights reserved. 31
  • 32.
    NPB SP D-classruntime MPI vs OpenMP DThe runtime when using NPB SP DDclass is higher for MPI compared to the NC-OpenMP version when the NCnumber of cores is 144 or higher. The MPI library overhead in the communication part of the runtime now plays a larger role. 1200 NPB-SP NC-OPENMP D-CLASS Runtime (sec) bound to only the first <n> cores 1000 953,39 800 729,24 Runtime (sec) 701,58 OpenMP 708,68 MPI 600 599,04 554,57 534,34 478,14 415,93 400 200 0 144 Copyright 2013. All rights reserved. 169 196 Number of processes 225 256 32
  • 33.
    NPB SP E-classruntime EThe SP benchmark Eclass scales perfectly from 64 processes (using (using affinity 0-255:4) to 121 processes ( using affinity 0-241:2). Runtime [sec]: NPB-SP NC-OPENMP ECLASS 8 NumaConnect Nodes 18000 General statement: Larger problems scales better. 15242,11 14000 12000 Runtime E-class problems are more representative for large shared memory systems and clusters. 16000 10000 8000 7246,13 6000 4000 2000 0 64 121 Number of processes Copyright 2013. All rights reserved. 33
  • 34.
    NPB SP E-classruntime EThe SP benchmark EEclass scales good from 16 to 121 on a 32 node system with a total of 384 cores NB: The test with 384 cores has been run on an old system and are presented to show scaling and not absolute values. Runtime: NPB-SP NC-OPENMP E-CLASS 32 NumaConnect Nodes: "Time in seconds" 60000 53922,05 50000 40000 32481,25 30000 20000 19740,37 11247,03 10000 0 0-383:24 0-383:10 Copyright 2013. All rights reserved. 0-383:6 0-363:3 34
  • 35.
    Big Data withShared Memory “Any application requiring a large memory footprint can benefit from a shared memory computing environment.” William W. Thigpen, Chief, Engineering Branch, NASA Advanced Supercomputing (NAS) Division Copyright 2013. All rights reserved. 35