Numascale Product IBM

NumaConnect
Einar Rustad, Co-Founder & CTO
September 2013

Copyright 2013

All rights reserved.

1

NumaConnect
• Cache Coherent Global Shared Memory
and Shared IO
• Single Image Standard Operating System
• All kinds of APIs
Commodity Servers
Tightly Coupled into
One Monolithic System
with NumaConnect

At Cluster Price
Copyright 2013. All rights reserved.

2

NumaConnect - Share Everything

Shared Everything - One Single Operating System Image
Memory
Caches

Remote
Cache

NumaChip

Memory
Caches

Remote
Cache

NumaChip

Memory
Caches

Remote
Cache

NumaChip

Memory
Caches

CPUs

CPUs

CPUs

I/O

I/O

NumaChip

CPUs

I/O

Remote
Cache

I/O

NumaConnect Fabric (no switch required)


3

Technology Background
Convex Exemplar (Acquired by HP)
– First implementation of the ccNUMA architecture
from Dolphin in 1994

Data General Aviion (Acquired by EMC)
– Designed in 1996 with deliveries from 1997 - 2002
– Used Dolphin’s chips with 3 generations of
Dolphin’
processor/memory buses

I/O Attached Products for Clustering OEMs
–
–
–
–
–

Sun Microsystems (SunCluster)
SunCluster)
Siemens RM600 Server (IO Expansion)
Siemens Medical (3D CT)
Philips Medical (3D Ultra Sound)
Dassault/Thales Rafale

Dolphin’s
Cache Chip

Dolphin’s Low
Latency Clustering HW

HPC Clusters (WulfKit w. Scali)
Scali)
– First Low Latency Cluster Interconnect

4

NumaChip

IBM Microelectronics
ASIC
FCPBGA1023, 33mm x
33mm, 1mm ball pitch,
4-2-4 package
IBM cu-11 Technology
~ 2 million gates
Chip Size 9x11mm


5

NumaConnect Card


6

NumaChip Top Block Diagram
HyperTransport
SPI

SM

ccHT
Cave
SDRAM Cache

H2S

SDRAM Tags

CSR

Microcode

LC Config Data

SPI Init
Module

SCC

Crossbar switch, LCs, SERDES

XA

XB

YA

YB

ZA

ZB


7

NumaConnect™
NumaConnect™ Node Configuration

Memory
Memory
Memory
Memory

Memory
Memory
Memory
Memory

MultiMulti-Core
CPU

MultiMulti-Core
CPU
I/O
Bridge

Coherent HyperTransport

Memory
Numa

NumaChip

Cache+Tags

6 x4 SERDES links


8

NumaConnect™ System Architecture

MultiMulti-Core
CPU

Memory
Numa
Cache

2-D Torus

NumaChip

MultiMulti-Core
CPU

3-D Torus

I/O
Bridge

Memory
Memory
Memory
Memory

Memory
Memory
Memory
Memory

Multi-CPU Node

6 external links - flexible system configurations in multimultidimensional topologies


9

Cabling Example


10

2-D Dataflow
Memory
Memory
Memory
Memory

CPUs
NumaChip

Caches

Memory
Memory
Memory
Memory

CPUs

Memory
Memory
Memory
Memory

CPUs

NumaChip

Caches

NumaChip

Caches

Request
Response

11

Size does Matter

640K ought to be enough for anybody
– Bill Gates, 1981

We are looking for systems that can hold
10 – 20 TeraBytes in main Memory
– Trond J. Suul, Statoil, 2010

12

ScaleScale-up Capacity

Single System Image or Multiple
Partitions
Limits
– 256 TeraBytes Physical Address Space
– 4096 Nodes
– 196 608 cores

Largest and Most Cost Effective
Coherent Shared Memory
NO Virtualization SW Layer!

13

Performance

Copyright 2013


14

LMbench - Chart
Latency - LMbench
10000

1087,576

1000
308,61

287,01

304,582

147,125
100

16,732
10

6,421

1,251

64
12
8
25
6
51
2
10
24
20
48
40
96

32

16

8

4

2

1

1

0,
00
0
0, 49
00
1
0, 95
00
3
0, 91
00
7
0, 81
01
5
0, 62
03
12
0, 5
06
25
0,
12
5
0,
25
0,
5

Nano
oseconds

628,583

Array Size (MBytes)


15

The Memory Hierarchy
Access Times in the Memory Hierarchy
10 000 000

SSD; 100 000

Rotating Disk; 5 000 000

1 000 000

nx1TB

nx5TB

240GB

4GB

Remote Memory; 1 087
7

NumaCache; 308

10

L3 Cache ; 16,7

100

L2 Cache; 6,4

1 000

Local Memory; 90

10 000

L1 Cache; 1,3

Nanosec
conds

100 000

1
Typical size: 100KB

1MB

8MB


< 256TB

16

Memory Bandwidth
------------------------------------------------------------STREAM version $Revision: 5.9 $
------------------------------------------------------------This system uses 8 bytes per DOUBLE PRECISION word.
------------------------------------------------------------Array size = 180000000000, Offset = 0
Total memory required = 4119873.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
------------------------------------------------------------Number of Threads requested = 828
------------------------------------------------------------Function

Rate (MB/s)

Avg time

Min time

University of Oslo:
• 72 nodes - IBM x3755
• 1 728 cores
• 4.6 TBytes Shared Memory

Max time

Copy:

1599317.5028

1.9224

1.8008

2.1393

Scale:
Scale:

1468219.1643

2.0954

1.9616

2.2290

Add:
Add:

1664455.1221

2.8375

2.5954

3.0947

Triad:

1492414.0721

3.0478

2.8946

3.3267


17

Memory Allocation and Initialization
Time to Allocate and Ini alize 15GB Array in Parallel
500,0

500

450

450,0
449,0

432,5

423,4

400,0

450
400
350

300,0

300

250,0

250

222
200,0

Ra o

Seco
onds

350,0

200

211,1

150,0

150

112

100,0
50,0

23,40

0,0

9

100

1

50
3,77

2,02

0,96
0

8

16

104

Number of Processes
ScaleMP

Numascale


Ra o
18

MPI Latency (Pallas)

Micros
seconds

NumaConnect MPI Latency
7
6
5
4
3
2
1
0
0

1

2

4

8 16 32 64 128 256 512 1k
Message Size (Bytes)


19

MPI Barrier
MPI_Barrier on 32 Nodes
250

25,0
NumaConnect-BTL
19,9
18,3

20,0

13,8

15,0

102,66

4,5

10,0

5,0
8,52

7,43

4,06

52,67

50
8,09

155,84

100

213,38

13,0

10,72

150

1,78

Microse
econds

Ra o

Improvem
ment Ra o

Standard-SM-BTL

200

0

0,0
2

4

8

16

32

Number of Processes

20

MPI Barrier (Pallas)
MPI_Barrier, 2 - 256 processes
1200,0

25,0
23,1
NumaConnect-BTL
20,0

Standard-SM-BTL

20,0

Ra o
15,0
13,1
969,3

600,0

10,0

0,0

2

4

8

510,9

5,0
48,5

1,2

22,1

0,7

14,3
330,7

0,7

10,2
132,5

3,6
2,7

6,2
7,2

200,0

14,3
330,7

400,0

1,3
0,9

Microsec
conds

800,0

Improvemen Ra o
nt

1000,0

0,0
16

32

64

128

256

Number of MPI Processes

21

Scaling Applications with
NumaConnect
Atle Vesterkjær, Numascale
av@numascale.com
September 2013

Copyright 2013


22

RWTH TrajSearch code
OpenMP programs can be made numa-aware by decomposing memory and making sure that all memory
numais “local” to the CPU running the process that is using the memory. Affinity settings can be used to
distribute jobs in a numa-aware way. The graph shows that the application has the most speedup on
numaNumaConnect,
NumaConnect, even if it was originally adapted to ScaleMP.
ScaleMP.
140000

300

120000

250

100000

80000
150

Sp
peedup

Runtime in Seconds
me

200

60000
100

40000

50

20000

0

0
8

16

32

64

128

256

512

Number of Threads

SGI Altix UV: Nehalem EX

SCALEMP: Nehalem EX

Bull BCS: Nehalem EX

SCALEMP: SandyBridge EP

NUMASCALE

SGI Altix UV: Nehalem EX-Speedup

SCALEMP: Nehalem EX-Speedup

Bull BCS: Nehalem EX-Speedup

SCALEMP: SandyBridge EP-Speedup

Numascale-Speedup


23

Stream on 16 Nodes
Applications that are scalable by design achieve great results on
a NumaConnect Shared Memory System.
Streams scaling (16 nodes - 32 sockets - 192 cores)
200 000

150 000

Triad – NumaConnect
Triad – 1041A-T2F

MByte/sec

Triad – Linear socket
Triad – Linear node
100 000

50 000

0
1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Number of CPU Sockets


24

Porting MPI applications to NC-OpenMP
NC-

OpenMP programs are faster and has less
overhead than MPI programs.
Both numa-aware programs and MPI
numaprograms care about memory locality. It is
locality.
therefore often possible to take an MPI
program and convert it to a numa-aware
numaOpenMP program that will have shorter
runtime.
In order to demonstrate this the NAS
Parallel Benchmark SP has been used.


25

NPB SP D-class Mop/s total
DThe SP benchmark exists
in both an OpenMP
version written for
standalone servers and
an MPI version written
for distributed servers.

The graph shows the
scalability for the SP
benchmark code when
converted to a
NumaConnect optimized
version.

70000

60000

62988,6

54029,36
50000

Mop/s Total

As the computational job
is the same we find the
code written for
distributed servers (MPI)
easier to convert to a
NumaConnect optimized
version since we also
need to consider
memory locality.

NPB-SP NC-OPENMP
256 cores
8 NumaConnect Nodes
CLASS=D

40000
35600,82

30000

20700,87
20000

10000

0
16

36
64
Number of threads


121

26

NPB SP D-class runtime
D1600

1426,8

1400

1200

1000
Runtime [sec]

The
overhead
introduced
by MPI is not
needed when
we are
running on a
Shared
Memory
System

NPB-SP NC-OPENMP D-CLASS
8 NumaConnect Nodes: Runtime [sec]

829,64

800

600
546,67
468,91
400

200

0
16

36

64

121

Number of threads

27

NPB SP D-class runtime affinity effect
DThe figure on
the left
illustrates the
great scaling
we get from the
NPBNPB-SP on the
NumaConnect
system

These effects will
be analyzed
more in the next
slides

6000
5701,24

1600

5000

1426,8

1400

1200
4000
1000
Runtime (sec)

Runtime (sec)

The figure on
the right
illustrates that
using affinity to
distribute the
job evenly
between the
different
NumaConnect
Nodes (and
cores in the
system) leads to
shorter
runtime

Runtime (sec) where the
threads are evenly distributed
on all NumaConnect Nodes

Runtime (sec) bound to only the
first <n> cores

3000
2496,88
2000

829,64

800

600
546,67
1512,98

468,91
400

1000

967,72
200

0
16

36

64

121

Number of processes

0
16

36

64

121

Number of processes

28

DThe runtime using 121
cores is cut to half
when running on 8
NumaConnect
Nodes, compared to 4
NumaConnect Nodes

The picture on the
right shows that the
The AMD Opteron™
6300 series processor
is organized as a
MultiMulti-Chip Module
(two CPU dies in the
package).
package). Each die
has 8 cores, but 4
FPUs

1200

1000

998,86

800
Runtime (sec)
sec)

The number of cores
pr. FPUs is one when
running on 8
NumaConnect Nodes
and two when running
on 4 NumaConnect
Nodes.
Nodes.

8 NumaConnect Nodes: Runtime [sec]
vs Affinity 121 cores using SP CLASS D

600
475,28
400

200

0
0-127

0-127:2
Affinty


29

DRuntime (sec) decreasing
when moving from 4 to 8
NumaConnect Nodes using
affinity settings and
SP CLASS D

The runtime using 64
cores is cut to half when
running on 8
NumaConnect
Nodes, compared to 4
NumaConnect Nodes

900

In this test the number
of FPUs is constant

800

The scaling is good when
using more nodes

800,69

700

The AMD Opteron™ 6300
series memory controller
interface allow external
connection of up to two
memory channels per
die. This memory
interface is shared
among all the CPU core
pairs on each die.

Runtime (sec)

600
548,13
500
400

300

200
100

0
0-127:2

0-255:4
Affintity settings


30

NPB SP D-class runtime MPI vs OpenMP
DNPB-SP NC-OPENMP D-CLASS
Runtime (sec) where the threads are
evenly distributed on all NumaConnect
Nodes

The runtime when using NPB SP DDclass is the same for the MPI
version as when using NC-OpenMP
NCwhen the number of cores is 121 or
less.
This MPI library overhead in the
communication part of the runtime
is not significant for these small
sizes.

1600
1426,8

1400

OpenMP

1200

MPI
Runtime (sec)

1000
829,64

800

600
546,67
468,91
400

200

0
16

36

64

121

Number of processes


31

NPB SP D-class runtime MPI vs OpenMP
DThe runtime when using NPB SP DDclass is higher for MPI compared to
the NC-OpenMP version when the
NCnumber of cores is 144 or higher.
The MPI library overhead in the
communication part of the runtime
now plays a larger role.

1200

Runtime (sec) bound to only the first <n>
cores

1000
953,39

800
729,24
Runtime (sec)

701,58

OpenMP

708,68

MPI

600

599,04
554,57

534,34
478,14
415,93

400

200

0
144


169

196
Number of processes

225

256

32

NPB SP E-class runtime
EThe SP benchmark Eclass scales perfectly
from 64 processes (using
(using
affinity 0-255:4) to 121
processes ( using affinity
0-241:2).

Runtime [sec]: NPB-SP NC-OPENMP ECLASS
8 NumaConnect Nodes
18000

General statement:
Larger problems scales
better.

15242,11
14000
12000
Runtime

E-class problems are
more representative for
large shared memory
systems and clusters.

16000

10000
8000
7246,13
6000
4000
2000
0
64

121
Number of processes


33

NPB SP E-class runtime
EThe SP benchmark EEclass scales good
from 16 to 121 on a
32 node system with
a total of 384 cores
NB: The test with 384
cores has been run on
an old system and are
presented to show
scaling and not
absolute values.

Runtime: NPB-SP NC-OPENMP E-CLASS
32 NumaConnect Nodes: "Time in seconds"
60000

53922,05
50000

40000

32481,25
30000

20000

19740,37

11247,03

10000

0
0-383:24

0-383:10


0-383:6

0-363:3
34

Big Data with Shared Memory

“Any application requiring a large memory
footprint can benefit from a shared memory
computing environment.”
William W. Thigpen, Chief, Engineering
Branch, NASA Advanced Supercomputing (NAS)
Division


35

Numascale Product IBM

More Related Content

What's hot

Similar to Numascale Product IBM

More from IBM Danmark

Recently uploaded

Numascale Product IBM