McMPI
Managed-code MPI library in Pure C#
Dr D Holmes, EPCC
dholmes@epcc.ed.ac.uk
Outline
• Yet another MPI library?
• Managed-code, C#, Windows
• McMPI, design and implementation details
• Object-orientation, design patterns,
communication performance results
• Threads and the MPI Standard
• Pre-“End Points proposal” ideas
Why Implement MPI Again?
• Parallel program, distributed memory => MPI library
• Most (all?) MPI libraries written in C
• MPI Standard provides C and FORTRAN bindings
• C++ can use the C functions
• Other languages can follow the C++ model
• Use the C functions
• Alternatively, MPI can be implemented in that language
• Removes inter-language function call overheads but …
• May not be possible to achieve comparable performance
Why Did I Choose C#?
• Experience and knowledge I gained from my career in
software development
• My impression of the popularity of C# in commercial
software development
• My desire to bridge the gap between high-performance
programming and high-productivity programming
• One of the UK research councils offered me funding for a
PhD that proposed to use C# to implement MPI
C# Myths
• C# only runs on Windows
• Not such a bad thing – 3 of the Top500 machines use Windows
• Not actually true – Mono works on multiple operating systems
• C# is a Microsoft language
• Not such a bad thing – resources, commitment, support, training
• Not actually true – C# follows ECMA and ISO standards
• C# is slow like Java
• Not such a bad thing – expressivity, readability, re-usability
• Not actually true – no easy way to prove this conclusively
• C# and its ilk are not things we need to care about
• Not such a bad thing – they will survive/thrive, or not, without us
• Not actually true – popularity trumps utility
McMPI Design & Implementation
• Desirable features of code
• Isolation of concerns -> easier to understand
• Human readability -> easier to maintain
• Compiler readability -> easier to get good performance
• Object-orientation can help with isolation of concerns
• So can modularisation and judiciously reducing LOC per code file
• Design patterns can help with human readability
• So can documentation and useful in-code comments
• Choice of language & compiler can help with performance
• So can coding style and detailed examination of compiler output
• What is the best compromise?
Communication Layer
• Abstract class factory design pattern
• Similar to plug-ins
• Enables addition of new functionality without re-compilation of the
rest of the library
• All communication modules:
• Implement the same Abstract Device Interface (ADI)
• Isolate the details of their implementation from other layers
• Provide the same semantics and capabilities
• Reliable delivery
• Ordering of delivery
• Preservation of message boundaries
• Message = fixed size envelope information and variable size user data
Communication Layer – UML
Protocol Layer
• Bridge design pattern
• Enables addition of new functionality without re-compilation of the
rest of the library
• All protocol messages:
• Implement inherit from the same base class
• Isolate the details of their implementation from other layers
• Modify state of internal shared data structures independently
• Shared data structures (message ‘queues’)
• Unexpected queue – message envelope at receiver before receive
• Request queue – receive called before message envelope arrival
• Matched queue – at receiver waiting for message data to arrive
• Pending queue – message data waiting at sender
Protocol Layer – UML
Interface Layer
• Simple façade design pattern
• Translates MPI Standard-like syntax into protocol layer syntax
• Will become adapter design pattern
• For example, when custom data-types are implemented
• Current version of McMPI covers parts of MPI 1 only
• Initialisation and finalisation
• Administration functions, e.g. to get rank and size of communicator
• Point-to-point communication functions
• ready, synchronous, standard (not buffered)
• blocking, non-blocking, persistent
• Previous version had collectives
• Implemented on top of point-to-point
• Using hypercube or binary tree algorithms
McMPI Implementation Overview
Performance Results – Introduction 1
• Shared-memory results – hardware details
• Number of Nodes: 1 Armari Magnetar server
• CPUs per Node: 2 Intel Xeon E5420
• Threads per CPU: 4 Quad-core, no hyper-threading
• Core Clock Speed: 2.5GHz Front-side bus 1333MHz
• Level 1 Cache: 4x2x32KB Data & instruction per core
• Level 2 Cache: 2x6MB One per pair of cores
• Memory per Node: 16GB DDR2 667MHz
• Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet
• Operating System: WinXP Pro 64bit with SP3 version 5.2.3790
Performance Results – Introduction 2
• Distributed-memory results – hardware details
• Number of Nodes: 18 Dell PowerEdge 2900
• CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6
• Threads per CPU: 2 Dual-core, no hyper-threading
• Core Clock Speed: 2.0GHz Front-side bus 1333MHz
• Level 1 Cache: 2x2x32KB Data & instruction per core
• Level 2 Cache: 1x4MB One per CPU
• Memory per Node: 4GB DDR2 533MHz
• Network Hardware: 2xNIC BCM5708C NetXtreme II GigE
• Operating System: Win2008 Server x64, SP2 version 6.0.6002
Shared-memory – Latency
0
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Latency(µs)
Message Size (bytes)
MPICH2 Shared Memory
MS-MPI Shared Memory
McMPI thread-to-thread
Shared-memory – Bandwidth
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Bandwidth(Mbit/s)
Message Size (bytes)
McMPI thread-to-thread
MPICH2 shared-memory
MS-MPI shared-memory
Distributed-memory – Latency
0
50
100
150
200
250
300
350
400
450
500
550
600
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Latency(µs)
Message Size (bytes)
McMPI Eager
MS-MPI
Distributed-memory – Bandwidth
0
250
500
750
1,000
4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Bandwidth(Mbit/s)
Message Size (bytes)
McMPI Rendezvous
McMPI Eager
MS-MPI
Thread-as-rank – Threading Level
• McMPI allows MPI_THREAD_AS_RANK as input for the
MPI_INIT_THREAD function
• McMPI creates new threads during initialisation
• Not needed – MPI_INIT_THREAD must be called enough times
• McMPI uses thread-local storage to store ‘rank’
• Not needed – each communicator handle can encode ‘rank’
• Thread-to-thread message delivery is zero-copy
• Direct copy from user send buffer to user receive buffer
• Any thread can progress MPI messages
Thread-as-rank – MPI Process
Diagram created
by Gaurav Saxena
MSc, 2013
Thread-as-rank – MPI Standard
• Is thread-as-rank compliant with the MPI Standard?
• Does the MPI Standard allow/support thread-as-rank?
• Ambiguous/debatable at best
• The MPI Standard assumes MPI process = OS process
• Call MPI_INIT or MPI_INIT_THREAD twice in one OS process
• Erroneous by definition or results in two MPI processes?
• MPI Standard “thread compliant” prohibits thread-as-rank
• To maintain a POSIX-process-like interface for MPI process
• End-points proposal violates this principle in exactly the same way
• Other possible interfaces exist
Thread-as-rank – End-points
• Similarities
• Multiple threads can communicate reliably without using tags
• Thread ‘rank’ can be stored in thread-local storage or handles
• Most common use-case likely requires MPI_THREAD_MULTIPLE
• Differences
• Thread-as-rank part of initialisation and active until finalisation
• End-points created after initialisation and can be destroyed
• Thread-as-rank has all possible ranks in MPI_COMM_WORLD
• End-points only has some ranks in MPI_COMM_WORLD
• Thread-as-rank cannot create ranks but may need to merge ranks
• End-points can create ranks and does not need to merge ranks
Thread-as-rank – MPI Forum Proposal?
• Short answer: no
• Long answer: not yet, it’s complicated
• More likely to be suggested amendments to end-points proposal
• Thread-as-rank is a special case of end-points
• Standard MPI_COMM_WORLD replaced with an end-points
communicator during MPI_INIT_THREAD
• Thread-safety implications are similar (possibly identical?)
• Advantages/opportunities similar
• Thread-to-thread delivery rather than process-to-process delivery
• Work-stealing MPI progress engine or per-thread message queues
Questions?

EuroMPI 2013 presentation: McMPI

  • 1.
    McMPI Managed-code MPI libraryin Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk
  • 2.
    Outline • Yet anotherMPI library? • Managed-code, C#, Windows • McMPI, design and implementation details • Object-orientation, design patterns, communication performance results • Threads and the MPI Standard • Pre-“End Points proposal” ideas
  • 3.
    Why Implement MPIAgain? • Parallel program, distributed memory => MPI library • Most (all?) MPI libraries written in C • MPI Standard provides C and FORTRAN bindings • C++ can use the C functions • Other languages can follow the C++ model • Use the C functions • Alternatively, MPI can be implemented in that language • Removes inter-language function call overheads but … • May not be possible to achieve comparable performance
  • 4.
    Why Did IChoose C#? • Experience and knowledge I gained from my career in software development • My impression of the popularity of C# in commercial software development • My desire to bridge the gap between high-performance programming and high-productivity programming • One of the UK research councils offered me funding for a PhD that proposed to use C# to implement MPI
  • 5.
    C# Myths • C#only runs on Windows • Not such a bad thing – 3 of the Top500 machines use Windows • Not actually true – Mono works on multiple operating systems • C# is a Microsoft language • Not such a bad thing – resources, commitment, support, training • Not actually true – C# follows ECMA and ISO standards • C# is slow like Java • Not such a bad thing – expressivity, readability, re-usability • Not actually true – no easy way to prove this conclusively • C# and its ilk are not things we need to care about • Not such a bad thing – they will survive/thrive, or not, without us • Not actually true – popularity trumps utility
  • 6.
    McMPI Design &Implementation • Desirable features of code • Isolation of concerns -> easier to understand • Human readability -> easier to maintain • Compiler readability -> easier to get good performance • Object-orientation can help with isolation of concerns • So can modularisation and judiciously reducing LOC per code file • Design patterns can help with human readability • So can documentation and useful in-code comments • Choice of language & compiler can help with performance • So can coding style and detailed examination of compiler output • What is the best compromise?
  • 7.
    Communication Layer • Abstractclass factory design pattern • Similar to plug-ins • Enables addition of new functionality without re-compilation of the rest of the library • All communication modules: • Implement the same Abstract Device Interface (ADI) • Isolate the details of their implementation from other layers • Provide the same semantics and capabilities • Reliable delivery • Ordering of delivery • Preservation of message boundaries • Message = fixed size envelope information and variable size user data
  • 8.
  • 9.
    Protocol Layer • Bridgedesign pattern • Enables addition of new functionality without re-compilation of the rest of the library • All protocol messages: • Implement inherit from the same base class • Isolate the details of their implementation from other layers • Modify state of internal shared data structures independently • Shared data structures (message ‘queues’) • Unexpected queue – message envelope at receiver before receive • Request queue – receive called before message envelope arrival • Matched queue – at receiver waiting for message data to arrive • Pending queue – message data waiting at sender
  • 10.
  • 11.
    Interface Layer • Simplefaçade design pattern • Translates MPI Standard-like syntax into protocol layer syntax • Will become adapter design pattern • For example, when custom data-types are implemented • Current version of McMPI covers parts of MPI 1 only • Initialisation and finalisation • Administration functions, e.g. to get rank and size of communicator • Point-to-point communication functions • ready, synchronous, standard (not buffered) • blocking, non-blocking, persistent • Previous version had collectives • Implemented on top of point-to-point • Using hypercube or binary tree algorithms
  • 12.
  • 13.
    Performance Results –Introduction 1 • Shared-memory results – hardware details • Number of Nodes: 1 Armari Magnetar server • CPUs per Node: 2 Intel Xeon E5420 • Threads per CPU: 4 Quad-core, no hyper-threading • Core Clock Speed: 2.5GHz Front-side bus 1333MHz • Level 1 Cache: 4x2x32KB Data & instruction per core • Level 2 Cache: 2x6MB One per pair of cores • Memory per Node: 16GB DDR2 667MHz • Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet • Operating System: WinXP Pro 64bit with SP3 version 5.2.3790
  • 14.
    Performance Results –Introduction 2 • Distributed-memory results – hardware details • Number of Nodes: 18 Dell PowerEdge 2900 • CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6 • Threads per CPU: 2 Dual-core, no hyper-threading • Core Clock Speed: 2.0GHz Front-side bus 1333MHz • Level 1 Cache: 2x2x32KB Data & instruction per core • Level 2 Cache: 1x4MB One per CPU • Memory per Node: 4GB DDR2 533MHz • Network Hardware: 2xNIC BCM5708C NetXtreme II GigE • Operating System: Win2008 Server x64, SP2 version 6.0.6002
  • 15.
    Shared-memory – Latency 0 1 2 3 4 5 6 12 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Latency(µs) Message Size (bytes) MPICH2 Shared Memory MS-MPI Shared Memory McMPI thread-to-thread
  • 16.
    Shared-memory – Bandwidth 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 4,0968,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Bandwidth(Mbit/s) Message Size (bytes) McMPI thread-to-thread MPICH2 shared-memory MS-MPI shared-memory
  • 17.
    Distributed-memory – Latency 0 50 100 150 200 250 300 350 400 450 500 550 600 12 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Latency(µs) Message Size (bytes) McMPI Eager MS-MPI
  • 18.
    Distributed-memory – Bandwidth 0 250 500 750 1,000 4,0968,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Bandwidth(Mbit/s) Message Size (bytes) McMPI Rendezvous McMPI Eager MS-MPI
  • 19.
    Thread-as-rank – ThreadingLevel • McMPI allows MPI_THREAD_AS_RANK as input for the MPI_INIT_THREAD function • McMPI creates new threads during initialisation • Not needed – MPI_INIT_THREAD must be called enough times • McMPI uses thread-local storage to store ‘rank’ • Not needed – each communicator handle can encode ‘rank’ • Thread-to-thread message delivery is zero-copy • Direct copy from user send buffer to user receive buffer • Any thread can progress MPI messages
  • 20.
    Thread-as-rank – MPIProcess Diagram created by Gaurav Saxena MSc, 2013
  • 21.
    Thread-as-rank – MPIStandard • Is thread-as-rank compliant with the MPI Standard? • Does the MPI Standard allow/support thread-as-rank? • Ambiguous/debatable at best • The MPI Standard assumes MPI process = OS process • Call MPI_INIT or MPI_INIT_THREAD twice in one OS process • Erroneous by definition or results in two MPI processes? • MPI Standard “thread compliant” prohibits thread-as-rank • To maintain a POSIX-process-like interface for MPI process • End-points proposal violates this principle in exactly the same way • Other possible interfaces exist
  • 22.
    Thread-as-rank – End-points •Similarities • Multiple threads can communicate reliably without using tags • Thread ‘rank’ can be stored in thread-local storage or handles • Most common use-case likely requires MPI_THREAD_MULTIPLE • Differences • Thread-as-rank part of initialisation and active until finalisation • End-points created after initialisation and can be destroyed • Thread-as-rank has all possible ranks in MPI_COMM_WORLD • End-points only has some ranks in MPI_COMM_WORLD • Thread-as-rank cannot create ranks but may need to merge ranks • End-points can create ranks and does not need to merge ranks
  • 23.
    Thread-as-rank – MPIForum Proposal? • Short answer: no • Long answer: not yet, it’s complicated • More likely to be suggested amendments to end-points proposal • Thread-as-rank is a special case of end-points • Standard MPI_COMM_WORLD replaced with an end-points communicator during MPI_INIT_THREAD • Thread-safety implications are similar (possibly identical?) • Advantages/opportunities similar • Thread-to-thread delivery rather than process-to-process delivery • Work-stealing MPI progress engine or per-thread message queues
  • 24.