Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ramón Beivide
Universidad de Cantabria
Tendencias de Uso y Diseño de Redes
de Interconexión en Computadores
Paralelos
14 d...
1. Intro: MareNostrum
3
1. Intro: MareNostrum BSC,
Infiniband FDR10 non-blocking Folded Clos (up to 40 racks)
4
36-port FD...
5
1. Intro: Infiniband core switches
1. Intro: Cost dominated by (optical) wires
6
7
1. Intro: Blades
8
1. Intro: Blades
9
1. Intro: Multicore E5-2670 Xeon Processor
1. Intro: A row of servers in a Google DataCenter, 2012.
10
11
3. WSCs Array: Enrackable boards or blades + rack router
To other
clusters
Figure 1.1: Sketch of the typical elements i...
1. Intro: Cray Cascade (XC30, XC40)
13
14
1. Intro: Cray Cascade (XC30, XC40)
15
1. Intro: An Architectural Model
Interconnection Network
…M1 Mn
S/R
ATU
S/R
CPU1 …
L/SL/S
L/S L/SATU
…
S/RS/R
Interconn...
17
Outline
1. Introduction
2. Network Basis
Crossbars & Routers
Direct vs Indirect Networks
3. System networks
4. On-chip ...
19
2. Blocking vs. Non-blocking
• Reduction cost comes at the price of performance
– Some networks have the property of be...
21
2. Network Organization
Switches
End Nodes
Indirect (Centralized) and Direct (Distributed) Networks
2. Previous Myrinet...
2. IBM BG/Q (Direct, Distributed)
23
24
2. Network Organization
64-node system with 8-port switches, c = 4 32-node system ...
25
Outline
1. Introduction
2. Network Basis
3. System networks
Folded Clos
Tori
Dragonflies
4. On-chip networks (NoCs)
5. ...
27
3. Network Topology
Centralized Switched (Indirect) Networks
16 port Crossbar network
7
6
5
4
3
2
1
0
15
14
13
12
11
10...
29
3. Network Topology
Centralized Switched (Indirect) Networks
16 port, 5-stage Clos network
7
6
5
4
3
2
1
0
15
14
13
12
...
31
3. Network Topology
• Bidirectional MINs
• Increase modularity
• Reduce hop count, d
• Folded Clos network
– Nodes at t...
33
3. IBM BlueGene/L/P Network
Prismatic 32x32x64 Torus (mixed-radix networks)
BlueGene/P: 32x32x72 in maximum configurati...
3. IBM BG/Q
35
36
3 .BG Network Routing
X Wires
Y Wires
Z Wires
Adaptive Bubble Routing
ATC-UC Research Group
3. Fujitsu Tofu Network
37
38
3. More Recent Network Topologies
Distributed Switched (Direct) Networks
• Fully-connected n...
3. IBM PERCS
39
3. IBM PERCS
40
3. IBM PERCS
41
Organized as groups of
routers
Parameters:
• a: Routers per group
• p: Node per router
• h: Global link pe...
Minimal routing
• Longest path 3 hops:
local-global-local
• Good performance under
UN traffic
Adversarial traffic [1]
• AD...
45
3. Cray Cascade, electrical supernode
46
3. Cray Cascade, system and routing
47
Outline
1. Introduction
2. Network Basis
3. System networks
4. On-chip networks (NoCs)
Rings
Meshes
5. Some current res...
Global levels interconnect
4. On-Chip global interconnects
49
4. Metal Layers
50
4. Bumps & Balls
51
Multiple integration with 3D stacking…
4. 3D (& 2.5D) Stacking & Silicon Photonics
3M, IBM team to dev...
4. Rings from ARM
53
4. Rings from Intel
54
55
Folded ring:
Lower
maximum
physical
link length
4. Rings (Direct or Indirect?)
• Bidirectional Ring networks (folded)
–...
4. Meshes from Tilera
4. Mesh from Pythium Mars Architecture
These images were taken form the slides presented at Hot Chip...
4. Pythium Mars NoC
This image was taken form the slides presented at Hot Chips 2015
• 6 bi-directional ports switches
• 4...
4. Intel Knights Landing
61
4. Intel Knights Landing
62
4. Intel Knights Landing
63
4. Intel Knights Landing
64
65
Outline
1. Introduction
2. Network Basis
3. System networks
4. On-chip networks (NoCs)
5. Some current research
5. Some...
5. Full-system simulation including concentration
67
GEM5 + BookSim full‐system simulation platform parameters
ISA X86
Num...
5. Full-system simulation
Normalized execution time
and network latencies:
• Average latency has impact in AMAT.
• High la...
5. Router Power and Area
Network leakage power evaluation:
• FBFLY can manage higher concentrations because its higher BB....
Upcoming SlideShare
Loading in …5
×

Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos - Prof. Ramón Beivide

204 views

Published on

Los computadores actuales, desde los sistemas on-chip de los móviles hasta los más potentes supercomputadores, son paralelos. La escala va desde los 8 cores de un móvil hasta los millones desplegados por los grandes superomputadores. La necesidad de hacer visible la memoria del sistema a cada uno de sus cores se resuelve, independientemente de la escala, interconectando todos los cores con el rendimiento adecuado a unos costes acotados. En los sistemas de menor escala (MPSoCs), la memoria se comparte usando redes on-chip que transportan líneas de cache y comandos de coherencia. Unos pocos MPSoCs se interconectan formando servidores que usan mecanismos para extender la coherencia y compartir memoria usando una arquitectura CC-NUMA. Decenas de estos servidores se apilan en un rack y un número de racks (hasta centenares) constituyen un datacenter o un supercomputador. La memoria global resultante no puede ser compartida, pero sus contenidos son transferibles mediante el envío de mensajes a través de la red de sistema. Por ello, las redes son sistemas críticos y básicos en las arquitecturas de memoria de los computadores de cualquier gama. En esta charla se ofrecerá una visión argumentada de las elecciones que hacen diferentes fabricantes para el despliegue de las redes on-chip y de sistema que interconectan los computadores actuales.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos - Prof. Ramón Beivide

  1. 1. Ramón Beivide Universidad de Cantabria Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos 14 de Abril, 2016 Universidad Complutense de Madrid 2 Outline 1. Introduction 2. Network Basis 3. System networks 4. On-chip networks (NoCs) 5. Some current research
  2. 2. 1. Intro: MareNostrum 3 1. Intro: MareNostrum BSC, Infiniband FDR10 non-blocking Folded Clos (up to 40 racks) 4 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Infiniband 648-port FDR Core switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch 36-port FDR10 36-port FDR10 560 560 560 560 560 560 Leaf switches 40 iDataPlex racks / 3360 dx360 M4 nodes 18 18 18 18 12 3 links to each core 3 links to each core 3 links to each core 3 links to each core 2 links to each core FDR10 links 18 18 18 18 12 3 links to each core 3 links to each core 3 links to each core 3 links to each core 2 links to each core18 18 18 18 12 18 18 18 18 12 Latency: 0,7 μs Bandwidth: 40Gb/s
  3. 3. 5 1. Intro: Infiniband core switches 1. Intro: Cost dominated by (optical) wires 6
  4. 4. 7 1. Intro: Blades 8 1. Intro: Blades
  5. 5. 9 1. Intro: Multicore E5-2670 Xeon Processor 1. Intro: A row of servers in a Google DataCenter, 2012. 10
  6. 6. 11 3. WSCs Array: Enrackable boards or blades + rack router To other clusters Figure 1.1: Sketch of the typical elements in warehouse-scale systems: 1U server (left), 7’ rack with Ethernet switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/router (right). 12 3. WSC Hierarchy
  7. 7. 1. Intro: Cray Cascade (XC30, XC40) 13 14 1. Intro: Cray Cascade (XC30, XC40)
  8. 8. 15 1. Intro: An Architectural Model Interconnection Network …M1 Mn S/R ATU S/R CPU1 … L/SL/S L/S L/SATU … S/RS/R Interconnection Network Interconnection Network …M1 Mn S/R ATU S/R CPU1 CPUn… L/SL/S L/S L/SATU … … … … Interconnection Network 1. Intro: What we need for one ExaFlop/s 16 Networks are pervasive and critical components in Supercomputers, Datacenters, Servers and Mobile Computers. Complexity is moving from system networks towards on-chip networks: less nodes but more complex
  9. 9. 17 Outline 1. Introduction 2. Network Basis Crossbars & Routers Direct vs Indirect Networks 3. System networks 4. On-chip networks (NoCs) 5. Some current research 18 2. Network Basis All networks based on Crossbar switches • Switch complexity increases quadratically with the number of crossbar input/output ports, N, i.e., grows as O(N2) • Has the property of being non-blocking (N! I/O permutations) • Bidirectional for exploiting communication locality • Minimize latency & maximize throughput 7 6 5 4 3 2 1 0 76543210 7 6 5 4 3 2 1 0 76543210
  10. 10. 19 2. Blocking vs. Non-blocking • Reduction cost comes at the price of performance – Some networks have the property of being blocking (Not N!) – Contention is more likely to occur on network links › Paths from different sources to different destinations share one or more links blocking topology X non-blocking topology 7 6 5 4 3 2 1 0 76543210 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 20 2. Swith or Router Microarchitecture Routing Control Unit Header Flit Forwarding Table Input buffers DEMUX Input buffers DEMUX Physical channel Link Control Link Control Physical channel MUX CrossBar DEMUX MUX DEMUX Crossbar Control Stage 1 Output buffers MUX Link Control Output buffers MUX Link Control Physical channel Physical channel Stage 2 Stage 3 Stage 4 Stage 5 Arbitration Unit Output Port # IB (Input Buffering) RC (Route Computation) SA (Switch Arb) ST (Switch Traversal) OB (Output Buffering) IB IB IB RC IB SA IB IB ST ST IB IB ST IB IB ST OB OB OB OB Packet header Payload fragment Payload fragment Payload fragment Pipelined Switch Microarchitecture Matching the throughput of the internal switch datapath to the external link BW is the goal
  11. 11. 21 2. Network Organization Switches End Nodes Indirect (Centralized) and Direct (Distributed) Networks 2. Previous Myrinet core switches (Indirect, Centralized) 22
  12. 12. 2. IBM BG/Q (Direct, Distributed) 23 24 2. Network Organization 64-node system with 8-port switches, c = 4 32-node system with 8-port switches • As crossbars do not scale they need to be interconnected for servicing an increasing number of endpoints. • Direct (Distributed) vs Indirect (Centralized) Networks • Concentration can be used to reduce network costs – “c” end nodes connect to each switch – Allows larger systems to be built from fewer switches and links – Requires larger switch degree
  13. 13. 25 Outline 1. Introduction 2. Network Basis 3. System networks Folded Clos Tori Dragonflies 4. On-chip networks (NoCs) 5. Some current research 3. MareNostrum BSC, Infiniband FDR10 non-blocking Folded Clos (up to 40 racks) 26 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch Infiniband 648-port FDR Core switch Mellanox 648-port IB Core Switch Mellanox 648-port IB Core Switch 36-port FDR10 36-port FDR10 560 560 560 560 560 560 Leaf switches 40 iDataPlex racks / 3360 dx360 M4 nodes 18 18 18 18 12 3 links to each core 3 links to each core 3 links to each core 3 links to each core 2 links to each core FDR10 links 18 18 18 18 12 3 links to each core 3 links to each core 3 links to each core 3 links to each core 2 links to each core18 18 18 18 12 18 18 18 18 12 Latency: 0,7 μs Bandwidth: 40Gb/s
  14. 14. 27 3. Network Topology Centralized Switched (Indirect) Networks 16 port Crossbar network 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 28 3. Network Topology Centralized Switched (Indirect) Networks 16 port, 3-stage Clos network 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
  15. 15. 29 3. Network Topology Centralized Switched (Indirect) Networks 16 port, 5-stage Clos network 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 30 3. Network Topology Centralized Switched (Indirect) Networks 16 port, 7 stage Clos network = Benes topology 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8
  16. 16. 31 3. Network Topology • Bidirectional MINs • Increase modularity • Reduce hop count, d • Folded Clos network – Nodes at tree leaves – Switches at tree vertices – Total link bandwidth is constant across all tree levels, with full bisection bandwidth Folded Clos = Folded Benes <> Fat tree network !!! Centralized Switched (Indirect) Networks 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 Network Bisection 32 3. Other DIRECT System Network Topologies Distributed Switched (Direct) Networks 2Dtorusof16nodes hypercubeof16nodes (16=24,son=4) 2Dmeshorgridof16nodes Network Bisection ≤ full bisection bandwidth!
  17. 17. 33 3. IBM BlueGene/L/P Network Prismatic 32x32x64 Torus (mixed-radix networks) BlueGene/P: 32x32x72 in maximum configuration Mixed-radix prismatic Tori also used by Cray 3. IBM BG/Q 34
  18. 18. 3. IBM BG/Q 35 36 3 .BG Network Routing X Wires Y Wires Z Wires Adaptive Bubble Routing ATC-UC Research Group
  19. 19. 3. Fujitsu Tofu Network 37 38 3. More Recent Network Topologies Distributed Switched (Direct) Networks • Fully-connected network: all nodes are directly connected to all other nodes using bidirectional dedicated links 6 2 4 5 7 0 1 3
  20. 20. 3. IBM PERCS 39 3. IBM PERCS 40
  21. 21. 3. IBM PERCS 41 Organized as groups of routers Parameters: • a: Routers per group • p: Node per router • h: Global link per router • Well-balanced dragonfly [1] a = 2p =2h 3. Dragonfly Interconnection Network Inter-group •Global links •Complete graph Intra-group •Local links •Complete graph
  22. 22. Minimal routing • Longest path 3 hops: local-global-local • Good performance under UN traffic Adversarial traffic [1] • ADV+N: Nodes in group i send traffic to group i+N • Saturation of the global link 3. Dragonfly Interconnection Network Source Node Destination Node Destination Group i+N Source Group i SATURATION [1] J. Kim, W. Dally, S. Scott, and D. Abts. “Technology-driven, highly-scalable dragonfly topology.” ISCA ‘08. Valiant Routing [2] • Randomly selects an intermediate group to misroute packets • Avoids saturated channel • Longest path 5 hops local-global-local- global-local 3. Dragonfly Interconnection Network Source Node Destination Node Intermediate Group [2] L. Valiant, “A scheme for fast parallel communication," SIAM journal on com- puting, vol. 11, p. 350, 1982.
  23. 23. 45 3. Cray Cascade, electrical supernode 46 3. Cray Cascade, system and routing
  24. 24. 47 Outline 1. Introduction 2. Network Basis 3. System networks 4. On-chip networks (NoCs) Rings Meshes 5. Some current research SEM photo of local levels interconnect 4. On-Chip local interconnects 48
  25. 25. Global levels interconnect 4. On-Chip global interconnects 49 4. Metal Layers 50
  26. 26. 4. Bumps & Balls 51 Multiple integration with 3D stacking… 4. 3D (& 2.5D) Stacking & Silicon Photonics 3M, IBM team to develop 3D IC adhesive, EETimes India                                                                            STMicroelectronics & CEA 52
  27. 27. 4. Rings from ARM 53 4. Rings from Intel 54
  28. 28. 55 Folded ring: Lower maximum physical link length 4. Rings (Direct or Indirect?) • Bidirectional Ring networks (folded) – N switches (3 × 3) and N bidirectional network links – Simultaneous packet transport over disjoint paths – Packets must hop across intermediate nodes – Shortest direction usually selected (N/4 hops, on average) – Bisection Bandwidth??? 56 4. Meshes and Tori Distributed Switched (Direct) Networks 2Dtorusof16nodes 2Dmeshorgridof16nodes Network Bisection
  29. 29. 4. Meshes from Tilera 4. Mesh from Pythium Mars Architecture These images were taken form the slides presented at Hot Chips 2015 • L1: – Separated L1 Icache and L1 Dcache – 32 KB Icache – 32 KB Dcache • 6 outstanding loads • 4 cycles latency from load to use • L2: – 16 L2 banks of 4 MB – 32 MB of shared L2 • L3: – 8 L3 arrays of 16 MB – 128 MB of L3 • Memory Controllers: – 16 DDR3-1600 channels • 2x16-lane PCIe-3.0 • Directory based cache coherency – 16 Directory Control Unit (DCU) • MOESI like cache coherence protocol
  30. 30. 4. Pythium Mars NoC This image was taken form the slides presented at Hot Chips 2015 • 6 bi-directional ports switches • 4 physical channels for cache coherence • 3 cycles for each hop • 384 GB/s each cell 4. Meshes from Intel Knights Landing 60 Intel Knights Landing – 3 options
  31. 31. 4. Intel Knights Landing 61 4. Intel Knights Landing 62
  32. 32. 4. Intel Knights Landing 63 4. Intel Knights Landing 64
  33. 33. 65 Outline 1. Introduction 2. Network Basis 3. System networks 4. On-chip networks (NoCs) 5. Some current research 5. Some research on NUCA-based CMP Models 66
  34. 34. 5. Full-system simulation including concentration 67 GEM5 + BookSim full‐system simulation platform parameters ISA X86 Number of Cores 64 CPU Model Out of Order CPU Frequency 2 GHz Cache Coherence Protocol MESI L1 Instructions Size 32 KB L1 Data Size 64 KB Shared distributed L2 256 KB per Core # Memory Controllers 4 Network Frequency 1 GHz Router Pipeline Stages 4 Physical Networks 3 Buffer Size 10 flits Link Width 64 bits Topologies 8x8 mesh, torus and FBFLY 4x4 FBFBLY with C=4 Applications used PARSEC benchmarks 5. Topology comparison Three different topologies are considered: 68 Topology 2D Mesh 2D Torus 2D FBFLY Degree (ports) ↓ Diameter (max. distance)↓ 2 Average distance ↓ Bisection Bandwidth (links)↑ Advantages Low degree Shortest links Low degree Symmetry Better properties Symmetry Best properties Larger concentration Disadvantages Largest distances Lowest BB Folding Deadlock Highest costs Non‐uniform link  lengths N=16
  35. 35. 5. Full-system simulation Normalized execution time and network latencies: • Average latency has impact in AMAT. • High latencies can degrade execution times if the affected data are critical. 69 5. Router Power and Area Router leakage power and area evaluation: • Buffers are the most consuming part of the router. • Crossbars and allocators grew quadratically with the number of ports. • The load in these simulations is low. Hence, the leakage power is the dominant one. 70
  36. 36. 5. Router Power and Area Network leakage power evaluation: • FBFLY can manage higher concentrations because its higher BB. 71 5. OmpSs vs. pThreads 72

×