Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Advancing Application Process Affinity Experimentation:
Open MPI's LAMA-Based Affinity Interface
Jeff Squyres
September 18...
Locality Matters
• Multiple talks here at EuroMPI’13 about
network locality
• Goals:
 Minimize data transfer distance
 R...
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256K...
The intent of this work is to provide a mechanism that
allows users to explore the process-placement space
within the scop...
LAMA
• Locality-Aware Mapping Algorithm (LAMA)
 Supports a wide range of regular mapping
patterns.
• Adapts at runtime to...
LAMA Inspiration
• Drawn from much prior work
• Most notably, heavily inspired by
BlueGene/P and /Q mapping systems
 LAMA...
Launching MPI Applications
• Three steps in MPI process placement
1. Mapping
2. Ordering
3. Binding
• Let's discuss how th...
1. Mapping
• Create a layout of processes-to-resources
Server Server Server Server
Server Server Server Server
Server Serv...
Mapping
• MPI's runtime must create a map, pairing
processes-to-processors (and memory).
• Basic technique:
 Gather hwloc...
Mapping Agent
• Act of planning mappings:
 Specify which process will be launched on
each server
 Identify if any hardwa...
Oversubscription
• Common / usual definition:
 When a single PU is assigned more than one
process
• Complicating the defi...
2. Ordering: By “Slot”
Assigning MCW ranks to mapped processes
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 ...
2. Ordering: By Node
Assigning MCW ranks to mapped processes
0 16 32 48
64 80 96 112
128 144 160 176
192 208 224 240
1 17 ...
Ordering
• Each process must be assigned a unique
rank in MPI_COMM_WORLD
• Two common types of ordering:
 natural
• The o...
3. Binding
• Launch processes and enforce the layout
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L...
Binding
• Process-launching agent working with the
OS to limit where each process can run:
1. No restrictions
2. Limited s...
Command Line Interface (CLI)
• 4 levels of abstraction for the user
 Level 1: None
 Level 2: Simple, common patterns
 L...
CLI: Level 1 (none)
• No mapping or binding options specified
 May or may not specify the number of
processes to launch (...
CLI: Level 2 (common)
• Simple, common patterns for mapping and
binding
 Specify mapping pattern with
• --map-by X (e.g.,...
CLI: Level 3 (regular patterns)
• LAMA process layout regular patterns
 Power users wanting something unique for
their ap...
rmaps_lama_map (map)
• Takes as an argument the "process layout"
 A series of nine tokens
• allowing 9! (362,880) mapping...
Example system
2 servers (nodes), 4 sockets, 2 cores, 2 PUs
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_map (map)
• map=scbnh (a.k.a., by socket, then by core)
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=3c (3 cores)Machine (128GB)
NUMANode P#0 (64GB)
Socket ...
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256K...
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=12 (all PUs in an L2)
bind = 12
rmaps_lama_bind (bind)
• “Binding width" and layer
• Example: bind=1N (all PUs in NUMA locality)
bind = 1N
rmaps_lama_order (order)
• Select which ranks are assigned to
processes in MCW
• There are other possible orderings, but n...
rmaps_lama_mppr (mppr)
• mppr (mip-per) sets the Maximum number
of allowable Processes Per Resource
 User-specified defin...
MPPR
 1:c  At most one process per coreMachine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1...
MPPR
 1:c,2:s  At most one process per core and
two processes per socket
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
...
CLI: Level 4 (rankfile)
• Complete specification of processor-to-
resource mapping description
 Bypasses LAMA
• Not descr...
Level 2 to Level 3 Chart
Remember the prior example?
• -np 24 -mppr 2:c -map scbnh
Same example, different mapping
• -np 24 -mppr 2:c -map nbsch
• Displays prettyprint representation of the
binding actually used for each process.
 Visual feedback = quite helpful whe...
Future Work
• Available in Open MPI v1.7.2 (and later)
• Open questions to users:
 Are more flexible ordering options use...
Thank You
Upcoming SlideShare
Loading in …5
×

Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

2,763 views

Published on

Presentation given at EuroMPI'13 by Jeff Squyres describing the flexible process affinity system in Open MPI 1.7.2 (and later).

Published in: Technology, Education
  • Be the first to comment

Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)

  1. 1. Advancing Application Process Affinity Experimentation: Open MPI's LAMA-Based Affinity Interface Jeff Squyres September 18, 2013 Joshua Hursey
  2. 2. Locality Matters • Multiple talks here at EuroMPI’13 about network locality • Goals:  Minimize data transfer distance  Reduce network congestion and contention • …this also matters inside the server, too!
  3. 3. Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI 8086:1521 eth0 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137:0043 eth4 PCI 1137:0043 eth5 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#8 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#9 PU P#25 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#10 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#11 PU P#27 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#12 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#13 PU P#29 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#14 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#15 PU P#31 PCI 1000:005b sda sdb PCI 1137:0043 eth6 PCI 1137:0043 eth7 Intel Xeon E5-2690 (“Sandy Bridge”) 2 sockets, 8 cores, 64GB per socket 1G NICs 10G NICs 10G NICs L1 and L2 Shared L3 Hyperthreading enabled
  4. 4. The intent of this work is to provide a mechanism that allows users to explore the process-placement space within the scope of their own applications. A User’s Playground
  5. 5. LAMA • Locality-Aware Mapping Algorithm (LAMA)  Supports a wide range of regular mapping patterns. • Adapts at runtime to available hardware  Supports homogeneous and heterogeneous systems. • Extensible to any depth of server topology  Naturally supports potentially deeper topologies of future server architectures.
  6. 6. LAMA Inspiration • Drawn from much prior work • Most notably, heavily inspired by BlueGene/P and /Q mapping systems  LAMA’s mapping specification is similar
  7. 7. Launching MPI Applications • Three steps in MPI process placement 1. Mapping 2. Ordering 3. Binding • Let's discuss how these work in Open MPI
  8. 8. 1. Mapping • Create a layout of processes-to-resources Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server Server MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI MPI
  9. 9. Mapping • MPI's runtime must create a map, pairing processes-to-processors (and memory). • Basic technique:  Gather hwloc topologies from allocated nodes.  Mapping agent then makes a plan for which resources are assigned to processes
  10. 10. Mapping Agent • Act of planning mappings:  Specify which process will be launched on each server  Identify if any hardware resource will be oversubscribed • Processes are mapped to the resolution of a single processing unit (PU)  Smallest unit of allocation: hardware thread  In HPC, usually the same as a processor core
  11. 11. Oversubscription • Common / usual definition:  When a single PU is assigned more than one process • Complicating the definition:  Some application may need more than one PU per process (multithreaded applications) • How can the user express what their application means by “oversubscription”?
  12. 12. 2. Ordering: By “Slot” Assigning MCW ranks to mapped processes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 36 40 44 48 49 50 51 64 65 66 67 80
  13. 13. 2. Ordering: By Node Assigning MCW ranks to mapped processes 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 2 66 130 194 4 20 36 52 5 23 37 53 6
  14. 14. Ordering • Each process must be assigned a unique rank in MPI_COMM_WORLD • Two common types of ordering:  natural • The order in which processes are mapped determines their rank in MCW  sequential • The processes are sequentially numbered starting at the first processing unit, and continuing until the last processing unit
  15. 15. 3. Binding • Launch processes and enforce the layout Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI 8086:1521 eth0 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137:0043 eth4 PCI 1137:0043 eth5 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#8 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#9 PU P#25 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#10 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#11 PU P#27 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#12 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#13 PU P#29 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#14 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#15 PU P#31 PCI 1000:005b sda sdb PCI 1137:0043 eth6 PCI 1137:0043 eth7 Indexes: physical Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI 8086:1521 eth0 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137:0043 eth4 PCI 1137:0043 eth5 PCI 102b:0522 NUMANode P#1 (64GB) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI 8086:1521 eth0 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137: eth4 PCI 1137: eth5 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#8 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#9 PU P#25 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#10 PU P#26 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#11 PU P#27 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#12 PU P#28 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#13 PU P#29 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#14 PU P#30 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#15 PU P#31 PCI 1000:005b sda sdb PCI 1137: eth6 PCI 1137: eth7 Indexes: physical Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI 8086:1521 eth0 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137: eth4 PCI 1137: eth5 PCI 102b:0522 NUMANode P#1 (64GB) 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (25 L1d (3 L1i (3 Core PU PU NUMANode P#1 (64GB) Socket P#1 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#8 PU P#24 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#9 PU P#25 L2 (25 L1d (3 L1i (3 Core PU PU Indexes: physical Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (25 L1d (3 L1i (3 Core PU PU NUMANode P#1 (64GB) 32 33 3 40 41 4 Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) PCI 8086:1521 eth0 PCI 8086:1521 Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) PCI 8086:1521 eth0 PCI 8086:1521 Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L2 (256KB) L2 (25
  16. 16. Binding • Process-launching agent working with the OS to limit where each process can run: 1. No restrictions 2. Limited set of restrictions 3. Specific resource restrictions • “Binding width”  The number of PUs to which a process is bound
  17. 17. Command Line Interface (CLI) • 4 levels of abstraction for the user  Level 1: None  Level 2: Simple, common patterns  Level 3: LAMA process layout regular patterns  Level 4: Irregular patterns
  18. 18. CLI: Level 1 (none) • No mapping or binding options specified  May or may not specify the number of processes to launch (-np)  If not specified, default to the number of cores available in the allocation  One process is mapped to each core in the system in a "by-core" style  Processes are not bound • …for backwards compatibility reasons 
  19. 19. CLI: Level 2 (common) • Simple, common patterns for mapping and binding  Specify mapping pattern with • --map-by X (e.g., --map-by socket)  Specify binding option with: • --bind-to Y (e.g., --bind-to core)  All of these options are translated to Level 3 options for processing by LAMA (full list of X / Y values shown later)
  20. 20. CLI: Level 3 (regular patterns) • LAMA process layout regular patterns  Power users wanting something unique for their application  Four MCA run-time parameters • rmaps_lama_map: Mapping process layout • rmaps_lama_bind: Binding width • rmaps_lama_order: Ordering of MCW ranks • rmaps_lama_mppr: Maximum allowable number of processes per resource (oversubscription)
  21. 21. rmaps_lama_map (map) • Takes as an argument the "process layout"  A series of nine tokens • allowing 9! (362,880) mapping permutation options.  Preferred iteration order for LAMA • innermost iteration specified first • outermost iteration specified last
  22. 22. Example system 2 servers (nodes), 4 sockets, 2 cores, 2 PUs
  23. 23. rmaps_lama_map (map) • map=scbnh (a.k.a., by socket, then by core)
  24. 24. rmaps_lama_map (map) • map=scbnh (a.k.a., by socket, then by core)
  25. 25. rmaps_lama_map (map) • map=scbnh (a.k.a., by socket, then by core)
  26. 26. rmaps_lama_map (map) • map=scbnh (a.k.a., by socket, then by core)
  27. 27. rmaps_lama_map (map) • map=scbnh (a.k.a., by socket, then by core)
  28. 28. rmaps_lama_bind (bind) • “Binding width" and layer • Example: bind=3c (3 cores)Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI e PCI e PCI e PCI e bind = 3c
  29. 29. Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 rmaps_lama_bind (bind) • “Binding width" and layer • Example: bind=2s (2 sockets) Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23 PCI 8086:1521 eth0 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137:0043 eth4 PCI 1137:0043 eth5 PCI 102b:0522 bind = 2s bind = 2s
  30. 30. rmaps_lama_bind (bind) • “Binding width" and layer • Example: bind=12 (all PUs in an L2) bind = 12
  31. 31. rmaps_lama_bind (bind) • “Binding width" and layer • Example: bind=1N (all PUs in NUMA locality) bind = 1N
  32. 32. rmaps_lama_order (order) • Select which ranks are assigned to processes in MCW • There are other possible orderings, but no one has asked for them yet… Natural order for map-by-node (default) Sequential order for any mapping
  33. 33. rmaps_lama_mppr (mppr) • mppr (mip-per) sets the Maximum number of allowable Processes Per Resource  User-specified definition of oversubscription • Comma-delimited list of <#:resource>  1:c  At most one process per core  1:c,2:s  At most one process per core, and at most two processes per socket
  34. 34. MPPR  1:c  At most one process per coreMachine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23
  35. 35. MPPR  1:c,2:s  At most one process per core and two processes per socket Machine (128GB) NUMANode P#0 (64GB) Socket P#0 L3 (20MB) L2 (256KB) L1d (32KB) L1i (32KB) Core P#0 PU P#0 PU P#16 L2 (256KB) L1d (32KB) L1i (32KB) Core P#1 PU P#1 PU P#17 L2 (256KB) L1d (32KB) L1i (32KB) Core P#2 PU P#2 PU P#18 L2 (256KB) L1d (32KB) L1i (32KB) Core P#3 PU P#3 PU P#19 L2 (256KB) L1d (32KB) L1i (32KB) Core P#4 PU P#4 PU P#20 L2 (256KB) L1d (32KB) L1i (32KB) Core P#5 PU P#5 PU P#21 L2 (256KB) L1d (32KB) L1i (32KB) Core P#6 PU P#6 PU P#22 L2 (256KB) L1d (32KB) L1i (32KB) Core P#7 PU P#7 PU P#23
  36. 36. CLI: Level 4 (rankfile) • Complete specification of processor-to- resource mapping description  Bypasses LAMA • Not described in the paper
  37. 37. Level 2 to Level 3 Chart
  38. 38. Remember the prior example? • -np 24 -mppr 2:c -map scbnh
  39. 39. Same example, different mapping • -np 24 -mppr 2:c -map nbsch
  40. 40. • Displays prettyprint representation of the binding actually used for each process.  Visual feedback = quite helpful when exploring mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind 1c --mca rmaps_lama_map nbsch --mca rmaps_lama_mppr 1:c --report- bindings hello_world MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..] Report Bindings
  41. 41. Future Work • Available in Open MPI v1.7.2 (and later) • Open questions to users:  Are more flexible ordering options useful?  What common mapping patterns are useful?  What additional features would you like to see?
  42. 42. Thank You

×