Mpi.Net Talk

1,734 views

Published on

Using MPI.NET for High Performance Computing .NET

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,734
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Good evening.My name is davidross and tonight I will be talking about the Message Passing Interface which is a high speed messaging framework used in the dvelopment of software on Supercomputers. Due to the commodisationof hardware and
  • Before I go discuss MPI it is very important when discussing concurrency or parallel programming that we understand the problem space the technology is trying to solve. Herb Sutter who runs the C++ team at Microsoft has broken the multi-core currency problem into 3 aspects. Isolated components, Concurrent collections and safety. MPI falls into the first bucket and is focussed on high speed data transfer between compute nodes.
  • Assuming we have a problem that is too slow to run under a single process or single machine. We have the problem of orchestrating data transfers and commands between the nodes.
  • MPI is an API standard not a product and there are many different implementations of the standard. MPI.NET meanwhile uses knowledge of the .NET type system to remove the complexity of using MPI from a language like C. In C u must explicitly state the size of collections and the width of complexity types being transferred. In MPI.NEY the serialisation is far easier.
  • As we will see in the code examples later MPI.NET is a
  • Mpi.Net Talk

    1. 1. MPI.NET<br />Supercomputing in .NET using the Message Passing Interface<br />David Ross<br />Email: willmation@gmail.com<br />Blog: www.pebblesteps.com<br />
    2. 2. Computationally complex problems in enterprise software<br />ETL load into Data Warehouse takes too long. Use compute clusters to quickly provide a summary report<br />Analyse massive database tables by processing chunks in parallel on the computer cluster<br />Increasing the speed of Monte Carlo analysis problems<br />Filtering/Analysis of massive log files<br />Click through analysis from IIS logs<br />Firewall logs<br />
    3. 3. Three Pillars of Concurrency<br />Herb Sutter/David Callahan break parallel computing techniques into:<br />Responsiveness and Isolation Via Asynchronous Agents <br />Active Objects, GUIs, Web Services, MPI<br />Throughput and Scalability Via Concurrent Collections<br />Parallel LINQ, Work Stealing, Open MP<br />Consistency Via Safely Shared Resources<br />Mutable shared Objects, Transactional Memory<br />Source - Dr. Dobb’s Journal<br /> http://www.ddj.com/hpc-high-performance-computing/200001985<br />
    4. 4. The Logical Supercomputer <br />Supercomputer:<br /><ul><li>Massively Parallel Machine/Workstations cluster
    5. 5. Batch orientated: Big Problem goes in, Sometime later result is found...</li></ul>Single System Image:<br /><ul><li>Doesn’t matter how the supercomputer is implemented in hardware/software it appears to the users as a SINGLE machine
    6. 6. Deployment of a program onto 1000 machines MUST be automated</li></li></ul><li>Message Passing Interface C based API for messaging<br />Specification not an implementation (standard by the MPI Forum)<br />Different vendors (including Open Source projects) provide implementations of the specification<br />MS-MPI is a fork (of MPICH2) by Microsoft to run on their HPC servers<br />Includes Active Directory support<br />Fast access to the MS network stack<br />
    7. 7. MPI Implementation<br />Standard defines:<br /><ul><li>Coding interface (C Header files)</li></ul>MPI Implementation is responsible for:<br /><ul><li>Communication with OS & hardware (Network cards, Pipes, NUMA etc...)
    8. 8. Data transport/Buffering</li></li></ul><li>MPI<br />Fork-Join parallelism<br />Work is segmented off to worker nodes<br />Results are collated back to the root node<br />No memory is shared<br />Separate machines or processes<br />Hence data locking is necessary/impossible<br />Speed critical<br />Throughput over development time<br />Large data orientated problems<br />Numerical analysis (matrices) are easily parallelised<br />
    9. 9. MPI.NETMPI.Net is a wrapper around MS-MPI<br />MPI is complex as C runtime can not infer:<br /> Array lengths<br />the size of complex types<br />MPI.NET is far simpler<br />Size of collections etc inferred from the type system automatically<br />IDispose used to setup/teardown MPI session<br />MPI.NET uses “unsafe” handcrafted IL for very fast marshalling of .Net objects to unmanaged MPI API<br />
    10. 10. Single Program Multiple Node<br />Same application is deployed to each node<br />Node Id is used to drive application/orchestration logic<br />Fork-Join/Map Reduce are the core paradigms<br />
    11. 11. Hello World in MPI<br />public class FrameworkSetup {<br />static void Main(string[] args) {<br />using (new MPI.Environment(ref args)){<br />string s = String.Format(<br />&quot;My processor is {0}. My rank is {1}&quot;,<br />MPI.Environment.ProcessorName,<br />Communicator.world.Rank); <br />Console.WriteLine(s);<br /> }<br /> }<br />}<br />
    12. 12. Executing<br />MPI.NET is designed to be hosted in Windows HPC Server<br />MPI.NET has recently been ported to Mono/Linux - still under development and not recommended<br />Windows HPC Pack SDK<br />mpiexec -n 4 SkillsMatter.MIP.Net.FrameworkSetup.exe<br />My processor is LPDellDevSL.digiterre.com. My rank is 0<br />My processor is LPDellDevSL.digiterre.com. My rank is 3<br />My processor is LPDellDevSL.digiterre.com. My rank is 2<br />My processor is LPDellDevSL.digiterre.com. My rank is 1<br />
    13. 13. Send/Receive<br />Logical Topology<br />static void Main(string[] args) {<br /> using (new MPI.Environment(ref args)) {<br /> if(Communicator.world.Size != 2)<br />throw new Exception(&quot;This application must be run with MPI Size == 0&quot; );<br />for(int i = 0; i &lt; NumberOfPings; i++) {<br /> if (Communicator.world.Rank == 0) { <br />string send = &quot;Hello Msg:&quot; + i;<br />Console.WriteLine(<br />&quot;Rank &quot; + Communicator.world.Rank + &quot; is sending: &quot; + send);<br />// Blocking send<br />Communicator.world.Send&lt;string&gt;(send, 1, 0);<br /> }<br /> Rank<br />drives parallelism<br />data, destination, message tag<br />
    14. 14. Send/Receive<br />else {<br />// Blocking receive<br />string s = Communicator.world.Receive&lt;string&gt;(0, 0);<br />Console.WriteLine(&quot;Rank &quot;+ Communicator.world.Rank + &quot; recieved: &quot; + s);<br /> }<br />Result:<br />Rank 0 is sending: Hello Msg:0<br />Rank 0 is sending: Hello Msg:1<br />Rank 0 is sending: Hello Msg:2<br />Rank 0 is sending: Hello Msg:3<br />Rank 0 is sending: Hello Msg:4<br />Rank 1 received: Hello Msg:0<br />Rank 1 received: Hello Msg:1<br />Rank 1 received: Hello Msg:2<br />Rank 1 received: Hello Msg:3<br />Rank 1 received: Hello Msg:4<br />source, message tag<br />
    15. 15. Send/Receive/Barrier<br />Send/Receive<br />Blocking point to point messaging<br />Immediate Send/Immediate Receive<br />Asynchronous point to point messaging<br />Request object has flags to indicate if operation is complete<br />Barrier<br />Global block<br />All programs halt until statement is executed on all nodes<br />
    16. 16. Broadcast/Scatter/Gather/Reduce<br />Broadcast<br />Send data from one Node to All other nodes<br />For a many node system as soon as a node receives the shared data it passes it on<br />Scatter<br />Split an array into Communicator.world.Sizechunks and send a chunk to each node<br />Typically used for sharing rows in a Matrix<br />
    17. 17. Broadcast/Scatter/Gather/Reduce<br />Gather<br />Each node sends a chunk of data to the root node<br />Inverse of the Scatter operation<br />Reduce<br />Calculate a result on each node<br />Combine the results into a single value through a reduction (Min, Max, Add, or custom delegate etc...)<br />
    18. 18. Data orientated problem<br />static void Main(string[] args) {<br /> using (new MPI.Environment(ref args)) {<br /> // Load Grades<br />intnumberOfGrades = 0; <br /> double[] allGrades = null;<br /> if (Communicator.world.Rank == RANK_0) {<br />allGrades = LoadStudentGrades();<br />numberOfGrades = allGrades.Length;<br /> }<br />Communicator.world.Broadcast(ref numberOfGrades, 0);<br />Load<br />Share(populates)<br />
    19. 19. // Root splits up array and sends to compute nodes<br />double[] grades = null;<br />intpageSize = numberOfGrades/Communicator.world.Size;<br />if (Communicator.world.Rank == RANK_0) {<br />Communicator.world.ScatterFromFlattened<br /> (allGrades,pageSize, 0, ref grades);<br /> } else {<br />Communicator.world.ScatterFromFlattened<br /> (null, pageSize, 0, ref grades);<br /> }<br />Array is broken into pageSize chunks and sent<br />Each chunk is deserialised into grades<br />
    20. 20. // Calculate the sum on each node<br />double sumOfMarks =<br />Communicator.world.Reduce&lt;double&gt;(grades.Sum(), Operation&lt;double&gt;.Add, 0);<br />// Calculate and publish average Mark<br />double averageMark = 0.0;<br />if (Communicator.world.Rank == RANK_0) {<br />averageMark = sumOfMarks / numberOfGrades;<br />}<br />Communicator.world.Broadcast(ref averageMark, 0);<br />...<br />Summarise<br />Share<br />
    21. 21. Result<br />Rank: 3, Sum of Marks:0, Average:50.7409948765608, stddev:0<br />Rank: 2, Sum of Marks:0, Average:50.7409948765608, stddev:0<br />Rank: 0, Sum of Marks:202963.979506243, Average:50.7409948765608, stddev:28.9402<br />362588477<br />Rank: 1, Sum of Marks:0, Average:50.7409948765608, stddev:0<br />
    22. 22. Fork-Join Parallelism<br />Load the problem parameters<br />Share the problem with the compute nodes<br />Wait and gather the results<br />Repeat<br />Best Practice:<br />Each Fork-Join block should be treated a separate Unit of Work<br />Preferably as a individual module otherwise spaghetti code can ensue<br />
    23. 23. When to use<br />PLINQ or Parallel Task Library (1st choice)<br />Map-Reduce operation to utilise all the cores on a box<br />Web Services / WCF (2nd choice)<br />No data sharing between nodes<br />Load balancer in front of a Web Farm is far easier development<br />MPI<br />Lots of sharing of intermediate results<br />Huge data sets<br />Project appetite to invest in a cluster or to deploy to a cloud<br />MPI + PLINQ Hybrid (3rd choice)<br />MPI moves data<br />PLINQ utilises cores<br />
    24. 24. More Information<br />MPI.Net: http://www.osl.iu.edu/research/mpi.net/software/<br />Google: Windows HPC Pack 2008 SP1<br />MPI Forum: http://www.mpi-forum.org/<br />Slides and Source: http://www.pebblesteps.com<br />Thanks for listening...<br />

    ×