The document discusses Open MPI's Locality-Aware Mapping Algorithm (LAMA) interface for controlling process placement on parallel machines. It describes how LAMA allows users to specify regular mapping patterns of processes to resources. It also outlines the three main steps in MPI process placement with LAMA: 1) mapping processes to resources, 2) ordering processes, and 3) binding processes during launch according to the mapping. The goal is to provide a mechanism for exploring different process placements to minimize communication costs.
Nell’iperspazio con Rocket: il Framework Web di Rust!
Open MPI Explorations in Process Affinity (EuroMPI'13 presentation)
1. Advancing Application Process Affinity Experimentation:
Open MPI's LAMA-Based Affinity Interface
Jeff Squyres
September 18, 2013
Joshua Hursey
2. Locality Matters
• Multiple talks here at EuroMPI’13 about
network locality
• Goals:
Minimize data transfer distance
Reduce network congestion and contention
• …this also matters inside the server, too!
3. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Intel Xeon E5-2690 (“Sandy Bridge”)
2 sockets, 8 cores, 64GB per socket
1G
NICs
10G
NICs
10G
NICs
L1 and L2
Shared L3
Hyperthreading enabled
4. The intent of this work is to provide a mechanism that
allows users to explore the process-placement space
within the scope of their own applications.
A User’s Playground
5. LAMA
• Locality-Aware Mapping Algorithm (LAMA)
Supports a wide range of regular mapping
patterns.
• Adapts at runtime to available hardware
Supports homogeneous and heterogeneous
systems.
• Extensible to any depth of server topology
Naturally supports potentially deeper
topologies of future server architectures.
6. LAMA Inspiration
• Drawn from much prior work
• Most notably, heavily inspired by
BlueGene/P and /Q mapping systems
LAMA’s mapping specification is similar
7. Launching MPI Applications
• Three steps in MPI process placement
1. Mapping
2. Ordering
3. Binding
• Let's discuss how these work in Open MPI
9. Mapping
• MPI's runtime must create a map, pairing
processes-to-processors (and memory).
• Basic technique:
Gather hwloc topologies from allocated nodes.
Mapping agent then makes a plan for which
resources are assigned to processes
10. Mapping Agent
• Act of planning mappings:
Specify which process will be launched on
each server
Identify if any hardware resource will be
oversubscribed
• Processes are mapped to the resolution of
a single processing unit (PU)
Smallest unit of allocation: hardware thread
In HPC, usually the same as a processor core
11. Oversubscription
• Common / usual definition:
When a single PU is assigned more than one
process
• Complicating the definition:
Some application may need more than one
PU per process (multithreaded applications)
• How can the user express what their
application means by “oversubscription”?
14. Ordering
• Each process must be assigned a unique
rank in MPI_COMM_WORLD
• Two common types of ordering:
natural
• The order in which processes are mapped
determines their rank in MCW
sequential
• The processes are sequentially numbered starting
at the first processing unit, and continuing until the
last processing unit
15. 3. Binding
• Launch processes and enforce the layout
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
Indexes: physical
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:
eth4
PCI 1137:
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#10
PU P#26
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#11
PU P#27
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#12
PU P#28
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#13
PU P#29
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#14
PU P#30
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#15
PU P#31
PCI 1000:005b
sda sdb
PCI 1137:
eth6
PCI 1137:
eth7
Indexes: physical
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
PCI 8086:1521
eth0
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:
eth4
PCI 1137:
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (25
L1d (3
L1i (3
Core
PU
PU
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#8
PU P#24
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#9
PU P#25
L2 (25
L1d (3
L1i (3
Core
PU
PU
Indexes: physical
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (25
L1d (3
L1i (3
Core
PU
PU
NUMANode P#1 (64GB)
32 33 3
40 41 4
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB)
PCI 8086:1521
eth0
PCI 8086:1521
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB)
PCI 8086:1521
eth0
PCI 8086:1521
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB) L2 (256KB) L2 (25
16. Binding
• Process-launching agent working with the
OS to limit where each process can run:
1. No restrictions
2. Limited set of restrictions
3. Specific resource restrictions
• “Binding width”
The number of PUs to which a process is
bound
17. Command Line Interface (CLI)
• 4 levels of abstraction for the user
Level 1: None
Level 2: Simple, common patterns
Level 3: LAMA process layout regular patterns
Level 4: Irregular patterns
18. CLI: Level 1 (none)
• No mapping or binding options specified
May or may not specify the number of
processes to launch (-np)
If not specified, default to the number of cores
available in the allocation
One process is mapped to each core in the
system in a "by-core" style
Processes are not bound
• …for backwards compatibility reasons
19. CLI: Level 2 (common)
• Simple, common patterns for mapping and
binding
Specify mapping pattern with
• --map-by X (e.g., --map-by socket)
Specify binding option with:
• --bind-to Y (e.g., --bind-to core)
All of these options are translated to Level 3
options for processing by LAMA
(full list of X / Y values shown later)
20. CLI: Level 3 (regular patterns)
• LAMA process layout regular patterns
Power users wanting something unique for
their application
Four MCA run-time parameters
• rmaps_lama_map: Mapping process layout
• rmaps_lama_bind: Binding width
• rmaps_lama_order: Ordering of MCW ranks
• rmaps_lama_mppr: Maximum allowable number of
processes per resource (oversubscription)
21. rmaps_lama_map (map)
• Takes as an argument the "process layout"
A series of nine tokens
• allowing 9! (362,880) mapping permutation options.
Preferred iteration order for LAMA
• innermost iteration specified first
• outermost iteration specified last
32. rmaps_lama_order (order)
• Select which ranks are assigned to
processes in MCW
• There are other possible orderings, but no
one has asked for them yet…
Natural order for
map-by-node (default)
Sequential order for
any mapping
33. rmaps_lama_mppr (mppr)
• mppr (mip-per) sets the Maximum number
of allowable Processes Per Resource
User-specified definition of oversubscription
• Comma-delimited list of <#:resource>
1:c At most one process per core
1:c,2:s At most one process per core, and
at most two processes per socket
34. MPPR
1:c At most one process per coreMachine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
35. MPPR
1:c,2:s At most one process per core and
two processes per socket
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#0
PU P#0
PU P#16
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#1
PU P#1
PU P#17
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#2
PU P#2
PU P#18
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#3
PU P#3
PU P#19
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#4
PU P#4
PU P#20
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#5
PU P#5
PU P#21
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#6
PU P#6
PU P#22
L2 (256KB)
L1d (32KB)
L1i (32KB)
Core P#7
PU P#7
PU P#23
36. CLI: Level 4 (rankfile)
• Complete specification of processor-to-
resource mapping description
Bypasses LAMA
• Not described in the paper
40. • Displays prettyprint representation of the
binding actually used for each process.
Visual feedback = quite helpful when exploring
mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind 1c --mca
rmaps_lama_map nbsch --mca rmaps_lama_mppr 1:c --report-
bindings hello_world
MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
Report Bindings
41. Future Work
• Available in Open MPI v1.7.2 (and later)
• Open questions to users:
Are more flexible ordering options useful?
What common mapping patterns are useful?
What additional features would you like to
see?