NumaConnect
Einar Rustad, Co-Founder & CTO
September 2013

Copyright 2013

All rights reserved.

1
NumaConnect
• Cache Coherent Global Shared Memory
and Shared IO
• Single Image Standard Operating System
• All kinds of AP...
NumaConnect - Share Everything

Shared Everything - One Single Operating System Image
Memory
Caches

Remote
Cache

NumaChi...
Technology Background
Convex Exemplar (Acquired by HP)
– First implementation of the ccNUMA architecture
from Dolphin in 1...
NumaChip

IBM Microelectronics
ASIC
FCPBGA1023, 33mm x
33mm, 1mm ball pitch,
4-2-4 package
IBM cu-11 Technology
~ 2 millio...
NumaConnect Card

Copyright 2013. All rights reserved.

6
NumaChip Top Block Diagram
HyperTransport
SPI

SM

ccHT
Cave
SDRAM Cache

H2S

SDRAM Tags

CSR

Microcode

LC Config Data
...
NumaConnect™
NumaConnect™ Node Configuration

Memory
Memory
Memory
Memory

Memory
Memory
Memory
Memory

MultiMulti-Core
CP...
NumaConnect™ System Architecture

MultiMulti-Core
CPU

Memory
Numa
Cache

2-D Torus

NumaChip

MultiMulti-Core
CPU

3-D To...
Cabling Example

Copyright 2013. All rights reserved.

10
2-D Dataflow
Memory
Memory
Memory
Memory

CPUs
NumaChip

Caches

Memory
Memory
Memory
Memory

CPUs

Memory
Memory
Memory
M...
Size does Matter

640K ought to be enough for anybody
– Bill Gates, 1981

We are looking for systems that can hold
10 – 20...
ScaleScale-up Capacity

Single System Image or Multiple
Partitions
Limits
– 256 TeraBytes Physical Address Space
– 4096 No...
Performance

Copyright 2013

All rights reserved.

14
LMbench - Chart
Latency - LMbench
10000

1087,576

1000
308,61

287,01

304,582

147,125
100

16,732
10

6,421

1,251

64
...
The Memory Hierarchy
Access Times in the Memory Hierarchy
10 000 000

SSD; 100 000

Rotating Disk; 5 000 000

1 000 000

n...
Memory Bandwidth
------------------------------------------------------------STREAM version $Revision: 5.9 $
-------------...
Memory Allocation and Initialization
Time to Allocate and Ini alize 15GB Array in Parallel
500,0

500

450

450,0
449,0

4...
MPI Latency (Pallas)

Micros
seconds

NumaConnect MPI Latency
7
6
5
4
3
2
1
0
0

1

2

4

8 16 32 64 128 256 512 1k
Messag...
MPI Barrier
MPI_Barrier on 32 Nodes
250

25,0
NumaConnect-BTL
19,9
18,3

20,0

13,8

15,0

102,66

4,5

10,0

5,0
8,52

7,...
MPI Barrier (Pallas)
MPI_Barrier, 2 - 256 processes
1200,0

25,0
23,1
NumaConnect-BTL
20,0

Standard-SM-BTL

20,0

Ra o
15...
Scaling Applications with
NumaConnect
Atle Vesterkjær, Numascale
av@numascale.com
September 2013

Copyright 2013

All righ...
RWTH TrajSearch code
OpenMP programs can be made numa-aware by decomposing memory and making sure that all memory
numais “...
Stream on 16 Nodes
Applications that are scalable by design achieve great results on
a NumaConnect Shared Memory System.
S...
Porting MPI applications to NC-OpenMP
NC-

OpenMP programs are faster and has less
overhead than MPI programs.
Both numa-a...
NPB SP D-class Mop/s total
DThe SP benchmark exists
in both an OpenMP
version written for
standalone servers and
an MPI ve...
NPB SP D-class runtime
D1600

1426,8

1400

1200

1000
Runtime [sec]

The
overhead
introduced
by MPI is not
needed when
we...
NPB SP D-class runtime affinity effect
DThe figure on
the left
illustrates the
great scaling
we get from the
NPBNPB-SP on ...
NPB SP D-class runtime
DThe runtime using 121
cores is cut to half
when running on 8
NumaConnect
Nodes, compared to 4
Numa...
NPB SP D-class runtime
DRuntime (sec) decreasing
when moving from 4 to 8
NumaConnect Nodes using
affinity settings and
SP ...
NPB SP D-class runtime MPI vs OpenMP
DNPB-SP NC-OPENMP D-CLASS
Runtime (sec) where the threads are
evenly distributed on a...
NPB SP D-class runtime MPI vs OpenMP
DThe runtime when using NPB SP DDclass is higher for MPI compared to
the NC-OpenMP ve...
NPB SP E-class runtime
EThe SP benchmark Eclass scales perfectly
from 64 processes (using
(using
affinity 0-255:4) to 121
...
NPB SP E-class runtime
EThe SP benchmark EEclass scales good
from 16 to 121 on a
32 node system with
a total of 384 cores
...
Big Data with Shared Memory

“Any application requiring a large memory
footprint can benefit from a shared memory
computin...
Upcoming SlideShare
Loading in …5
×

Numascale Product IBM

2,013 views

Published on

Presentation from the HPC event at IBM Denmark - September 2013, Copenhagen

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,013
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Numascale Product IBM

  1. 1. NumaConnect Einar Rustad, Co-Founder & CTO September 2013 Copyright 2013 All rights reserved. 1
  2. 2. NumaConnect • Cache Coherent Global Shared Memory and Shared IO • Single Image Standard Operating System • All kinds of APIs Commodity Servers Tightly Coupled into One Monolithic System with NumaConnect At Cluster Price Copyright 2013. All rights reserved. 2
  3. 3. NumaConnect - Share Everything Shared Everything - One Single Operating System Image Memory Caches Remote Cache NumaChip Memory Caches Remote Cache NumaChip Memory Caches Remote Cache NumaChip Memory Caches CPUs CPUs CPUs I/O I/O NumaChip CPUs I/O Remote Cache I/O NumaConnect Fabric (no switch required) Copyright 2013. All rights reserved. 3
  4. 4. Technology Background Convex Exemplar (Acquired by HP) – First implementation of the ccNUMA architecture from Dolphin in 1994 Data General Aviion (Acquired by EMC) – Designed in 1996 with deliveries from 1997 - 2002 – Used Dolphin’s chips with 3 generations of Dolphin’ processor/memory buses I/O Attached Products for Clustering OEMs – – – – – Sun Microsystems (SunCluster) SunCluster) Siemens RM600 Server (IO Expansion) Siemens Medical (3D CT) Philips Medical (3D Ultra Sound) Dassault/Thales Rafale Dolphin’s Cache Chip Dolphin’s Low Latency Clustering HW HPC Clusters (WulfKit w. Scali) Scali) – First Low Latency Cluster Interconnect Copyright 2013. All rights reserved. 4
  5. 5. NumaChip IBM Microelectronics ASIC FCPBGA1023, 33mm x 33mm, 1mm ball pitch, 4-2-4 package IBM cu-11 Technology ~ 2 million gates Chip Size 9x11mm Copyright 2013. All rights reserved. 5
  6. 6. NumaConnect Card Copyright 2013. All rights reserved. 6
  7. 7. NumaChip Top Block Diagram HyperTransport SPI SM ccHT Cave SDRAM Cache H2S SDRAM Tags CSR Microcode LC Config Data SPI Init Module SCC Crossbar switch, LCs, SERDES XA XB YA YB ZA ZB Copyright 2013. All rights reserved. 7
  8. 8. NumaConnect™ NumaConnect™ Node Configuration Memory Memory Memory Memory Memory Memory Memory Memory MultiMulti-Core CPU MultiMulti-Core CPU I/O Bridge Coherent HyperTransport Memory Numa NumaChip Cache+Tags 6 x4 SERDES links Copyright 2013. All rights reserved. 8
  9. 9. NumaConnect™ System Architecture MultiMulti-Core CPU Memory Numa Cache 2-D Torus NumaChip MultiMulti-Core CPU 3-D Torus I/O Bridge Memory Memory Memory Memory Memory Memory Memory Memory Multi-CPU Node 6 external links - flexible system configurations in multimultidimensional topologies Copyright 2013. All rights reserved. 9
  10. 10. Cabling Example Copyright 2013. All rights reserved. 10
  11. 11. 2-D Dataflow Memory Memory Memory Memory CPUs NumaChip Caches Memory Memory Memory Memory CPUs Memory Memory Memory Memory CPUs NumaChip Caches NumaChip Caches Request Response Copyright 2013. All rights reserved. 11
  12. 12. Size does Matter 640K ought to be enough for anybody – Bill Gates, 1981 We are looking for systems that can hold 10 – 20 TeraBytes in main Memory – Trond J. Suul, Statoil, 2010 Copyright 2013. All rights reserved. 12
  13. 13. ScaleScale-up Capacity Single System Image or Multiple Partitions Limits – 256 TeraBytes Physical Address Space – 4096 Nodes – 196 608 cores Largest and Most Cost Effective Coherent Shared Memory NO Virtualization SW Layer! Copyright 2013. All rights reserved. 13
  14. 14. Performance Copyright 2013 All rights reserved. 14
  15. 15. LMbench - Chart Latency - LMbench 10000 1087,576 1000 308,61 287,01 304,582 147,125 100 16,732 10 6,421 1,251 64 12 8 25 6 51 2 10 24 20 48 40 96 32 16 8 4 2 1 1 0, 00 0 0, 49 00 1 0, 95 00 3 0, 91 00 7 0, 81 01 5 0, 62 03 12 0, 5 06 25 0, 12 5 0, 25 0, 5 Nano oseconds 628,583 Array Size (MBytes) Copyright 2013. All rights reserved. 15
  16. 16. The Memory Hierarchy Access Times in the Memory Hierarchy 10 000 000 SSD; 100 000 Rotating Disk; 5 000 000 1 000 000 nx1TB nx5TB 240GB 4GB Remote Memory; 1 087 7 NumaCache; 308 10 L3 Cache ; 16,7 100 L2 Cache; 6,4 1 000 Local Memory; 90 10 000 L1 Cache; 1,3 Nanosec conds 100 000 1 Typical size: 100KB 1MB 8MB Copyright 2013. All rights reserved. < 256TB 16
  17. 17. Memory Bandwidth ------------------------------------------------------------STREAM version $Revision: 5.9 $ ------------------------------------------------------------This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------Array size = 180000000000, Offset = 0 Total memory required = 4119873.0 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------Number of Threads requested = 828 ------------------------------------------------------------Function Rate (MB/s) Avg time Min time University of Oslo: • 72 nodes - IBM x3755 • 1 728 cores • 4.6 TBytes Shared Memory Max time Copy: 1599317.5028 1.9224 1.8008 2.1393 Scale: Scale: 1468219.1643 2.0954 1.9616 2.2290 Add: Add: 1664455.1221 2.8375 2.5954 3.0947 Triad: 1492414.0721 3.0478 2.8946 3.3267 Copyright 2013. All rights reserved. 17
  18. 18. Memory Allocation and Initialization Time to Allocate and Ini alize 15GB Array in Parallel 500,0 500 450 450,0 449,0 432,5 423,4 400,0 450 400 350 300,0 300 250,0 250 222 200,0 Ra o Seco onds 350,0 200 211,1 150,0 150 112 100,0 50,0 23,40 0,0 9 100 1 50 3,77 2,02 0,96 0 8 16 104 Number of Processes ScaleMP Numascale Copyright 2013. All rights reserved. Ra o 18
  19. 19. MPI Latency (Pallas) Micros seconds NumaConnect MPI Latency 7 6 5 4 3 2 1 0 0 1 2 4 8 16 32 64 128 256 512 1k Message Size (Bytes) Copyright 2013. All rights reserved. 19
  20. 20. MPI Barrier MPI_Barrier on 32 Nodes 250 25,0 NumaConnect-BTL 19,9 18,3 20,0 13,8 15,0 102,66 4,5 10,0 5,0 8,52 7,43 4,06 52,67 50 8,09 155,84 100 213,38 13,0 10,72 150 1,78 Microse econds Ra o Improvem ment Ra o Standard-SM-BTL 200 0 0,0 2 4 8 16 32 Number of Processes Copyright 2013. All rights reserved. 20
  21. 21. MPI Barrier (Pallas) MPI_Barrier, 2 - 256 processes 1200,0 25,0 23,1 NumaConnect-BTL 20,0 Standard-SM-BTL 20,0 Ra o 15,0 13,1 969,3 600,0 10,0 0,0 2 4 8 510,9 5,0 48,5 1,2 22,1 0,7 14,3 330,7 0,7 10,2 132,5 3,6 2,7 6,2 7,2 200,0 14,3 330,7 400,0 1,3 0,9 Microsec conds 800,0 Improvemen Ra o nt 1000,0 0,0 16 32 64 128 256 Number of MPI Processes Copyright 2013. All rights reserved. 21
  22. 22. Scaling Applications with NumaConnect Atle Vesterkjær, Numascale av@numascale.com September 2013 Copyright 2013 All rights reserved. 22
  23. 23. RWTH TrajSearch code OpenMP programs can be made numa-aware by decomposing memory and making sure that all memory numais “local” to the CPU running the process that is using the memory. Affinity settings can be used to distribute jobs in a numa-aware way. The graph shows that the application has the most speedup on numaNumaConnect, NumaConnect, even if it was originally adapted to ScaleMP. ScaleMP. 140000 300 120000 250 100000 80000 150 Sp peedup Runtime in Seconds me 200 60000 100 40000 50 20000 0 0 8 16 32 64 128 256 512 Number of Threads SGI Altix UV: Nehalem EX SCALEMP: Nehalem EX Bull BCS: Nehalem EX SCALEMP: SandyBridge EP NUMASCALE SGI Altix UV: Nehalem EX-Speedup SCALEMP: Nehalem EX-Speedup Bull BCS: Nehalem EX-Speedup SCALEMP: SandyBridge EP-Speedup Numascale-Speedup Copyright 2013. All rights reserved. 23
  24. 24. Stream on 16 Nodes Applications that are scalable by design achieve great results on a NumaConnect Shared Memory System. Streams scaling (16 nodes - 32 sockets - 192 cores) 200 000 150 000 Triad – NumaConnect Triad – 1041A-T2F MByte/sec Triad – Linear socket Triad – Linear node 100 000 50 000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Number of CPU Sockets Copyright 2013. All rights reserved. 24
  25. 25. Porting MPI applications to NC-OpenMP NC- OpenMP programs are faster and has less overhead than MPI programs. Both numa-aware programs and MPI numaprograms care about memory locality. It is locality. therefore often possible to take an MPI program and convert it to a numa-aware numaOpenMP program that will have shorter runtime. In order to demonstrate this the NAS Parallel Benchmark SP has been used. Copyright 2013. All rights reserved. 25
  26. 26. NPB SP D-class Mop/s total DThe SP benchmark exists in both an OpenMP version written for standalone servers and an MPI version written for distributed servers. The graph shows the scalability for the SP benchmark code when converted to a NumaConnect optimized version. 70000 60000 62988,6 54029,36 50000 Mop/s Total As the computational job is the same we find the code written for distributed servers (MPI) easier to convert to a NumaConnect optimized version since we also need to consider memory locality. NPB-SP NC-OPENMP 256 cores 8 NumaConnect Nodes CLASS=D 40000 35600,82 30000 20700,87 20000 10000 0 16 36 64 Number of threads Copyright 2013. All rights reserved. 121 26
  27. 27. NPB SP D-class runtime D1600 1426,8 1400 1200 1000 Runtime [sec] The overhead introduced by MPI is not needed when we are running on a Shared Memory System NPB-SP NC-OPENMP D-CLASS 8 NumaConnect Nodes: Runtime [sec] 829,64 800 600 546,67 468,91 400 200 0 16 36 64 121 Number of threads Copyright 2013. All rights reserved. 27
  28. 28. NPB SP D-class runtime affinity effect DThe figure on the left illustrates the great scaling we get from the NPBNPB-SP on the NumaConnect system These effects will be analyzed more in the next slides 6000 5701,24 1600 5000 1426,8 1400 1200 4000 1000 Runtime (sec) Runtime (sec) The figure on the right illustrates that using affinity to distribute the job evenly between the different NumaConnect Nodes (and cores in the system) leads to shorter runtime NPB-SP NC-OPENMP D-CLASS Runtime (sec) where the threads are evenly distributed on all NumaConnect Nodes NPB-SP NC-OPENMP D-CLASS Runtime (sec) bound to only the first <n> cores 3000 2496,88 2000 829,64 800 600 546,67 1512,98 468,91 400 1000 967,72 200 0 16 36 64 121 Number of processes 0 16 36 64 121 Number of processes Copyright 2013. All rights reserved. 28
  29. 29. NPB SP D-class runtime DThe runtime using 121 cores is cut to half when running on 8 NumaConnect Nodes, compared to 4 NumaConnect Nodes The picture on the right shows that the The AMD Opteron™ 6300 series processor is organized as a MultiMulti-Chip Module (two CPU dies in the package). package). Each die has 8 cores, but 4 FPUs 1200 1000 998,86 800 Runtime (sec) sec) The number of cores pr. FPUs is one when running on 8 NumaConnect Nodes and two when running on 4 NumaConnect Nodes. Nodes. NPB-SP NC-OPENMP D-CLASS 8 NumaConnect Nodes: Runtime [sec] vs Affinity 121 cores using SP CLASS D 600 475,28 400 200 0 0-127 0-127:2 Affinty Copyright 2013. All rights reserved. 29
  30. 30. NPB SP D-class runtime DRuntime (sec) decreasing when moving from 4 to 8 NumaConnect Nodes using affinity settings and SP CLASS D The runtime using 64 cores is cut to half when running on 8 NumaConnect Nodes, compared to 4 NumaConnect Nodes 900 In this test the number of FPUs is constant 800 The scaling is good when using more nodes 800,69 700 The AMD Opteron™ 6300 series memory controller interface allow external connection of up to two memory channels per die. This memory interface is shared among all the CPU core pairs on each die. Runtime (sec) 600 548,13 500 400 300 200 100 0 0-127:2 0-255:4 Affintity settings Copyright 2013. All rights reserved. 30
  31. 31. NPB SP D-class runtime MPI vs OpenMP DNPB-SP NC-OPENMP D-CLASS Runtime (sec) where the threads are evenly distributed on all NumaConnect Nodes The runtime when using NPB SP DDclass is the same for the MPI version as when using NC-OpenMP NCwhen the number of cores is 121 or less. This MPI library overhead in the communication part of the runtime is not significant for these small sizes. 1600 1426,8 1400 OpenMP 1200 MPI Runtime (sec) 1000 829,64 800 600 546,67 468,91 400 200 0 16 36 64 121 Number of processes Copyright 2013. All rights reserved. 31
  32. 32. NPB SP D-class runtime MPI vs OpenMP DThe runtime when using NPB SP DDclass is higher for MPI compared to the NC-OpenMP version when the NCnumber of cores is 144 or higher. The MPI library overhead in the communication part of the runtime now plays a larger role. 1200 NPB-SP NC-OPENMP D-CLASS Runtime (sec) bound to only the first <n> cores 1000 953,39 800 729,24 Runtime (sec) 701,58 OpenMP 708,68 MPI 600 599,04 554,57 534,34 478,14 415,93 400 200 0 144 Copyright 2013. All rights reserved. 169 196 Number of processes 225 256 32
  33. 33. NPB SP E-class runtime EThe SP benchmark Eclass scales perfectly from 64 processes (using (using affinity 0-255:4) to 121 processes ( using affinity 0-241:2). Runtime [sec]: NPB-SP NC-OPENMP ECLASS 8 NumaConnect Nodes 18000 General statement: Larger problems scales better. 15242,11 14000 12000 Runtime E-class problems are more representative for large shared memory systems and clusters. 16000 10000 8000 7246,13 6000 4000 2000 0 64 121 Number of processes Copyright 2013. All rights reserved. 33
  34. 34. NPB SP E-class runtime EThe SP benchmark EEclass scales good from 16 to 121 on a 32 node system with a total of 384 cores NB: The test with 384 cores has been run on an old system and are presented to show scaling and not absolute values. Runtime: NPB-SP NC-OPENMP E-CLASS 32 NumaConnect Nodes: "Time in seconds" 60000 53922,05 50000 40000 32481,25 30000 20000 19740,37 11247,03 10000 0 0-383:24 0-383:10 Copyright 2013. All rights reserved. 0-383:6 0-363:3 34
  35. 35. Big Data with Shared Memory “Any application requiring a large memory footprint can benefit from a shared memory computing environment.” William W. Thigpen, Chief, Engineering Branch, NASA Advanced Supercomputing (NAS) Division Copyright 2013. All rights reserved. 35

×