[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)


Published on


Published in: Education, Technology

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)

  1. 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #7: GPU Cluster Programming | March 8th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  2. 2. Administrativia• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11• Project info: http://www.cs264.org/projects/projects.html• Project ideas: http://forum.cs264.org/index.php?board=6.0• Project proposal deadline: Fri 3/25/11 (but you should submit way before to start working on it asap)• Need a private private repo for your project? Let us know! Poll on the forum: http://forum.cs264.org/index.php?topic=228.0
  3. 3. Goodies• Guest Lectures: 14 distinguished speakers• Schedule updated (see website)
  4. 4. Goodies (cont’d)• Amazon AWS free credits coming soon (only for students who completed HW0+1)• It’s more than $14,000 donation for the class!• Special thanks: Kurt Messersmith @ Amazon
  5. 5. Goodies (cont’d)• Best Project Prize: Tesla C2070 (Fermi) Board• It’s more than $4,000 donation for the class!• Special thanks: David Luebke & Chandra Cheij @ NVIDIA
  6. 6. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  7. 7. Todayyey!!
  8. 8. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  9. 9. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  10. 10. The Problem Many computational problems too big for single CPU Lack of RAM Lack of CPU cycles Want to distribute work between many CPUsslide by Richard Edgar
  11. 11. Types of Parallelism Some computations are ‘embarrassingly parallel’ Can do a lot of computation on minimal data RC5 DES, SETI@HOME etc. Solution is to distribute across the Internet Use TCP/IP or similarslide by Richard Edgar
  12. 12. Types of Parallelism Some computations very tightly coupled Have to communicate a lot of data at each step e.g. hydrodynamics Internet latencies much too high Need a dedicated machineslide by Richard Edgar
  13. 13. Tightly Coupled Computing Two basic approaches Shared memory Distributed memory Each has advantages and disadvantagesslide by Richard Edgar
  14. 14. dvariables. variables.uted memory private memory for each processor, only acces uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede ocessor, so no synchronization for memory accesses needemationexchanged by sending data from one processor to ano ation exchanged by sending data from one processor to an interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M PP PP PP PP PP PP Interconnection Network Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” approach increasingly common “shared memory”d approach increasingly common now: mostly hybrid
  15. 15. variables.uted memory private memory for each processor, only acces Some terminologyocessor, so no synchronization for memory accesses needeation exchanged by sending data from one processor to ano interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M P P P PP PP PP Interconnection Network Interconnection Network Interconnection Network M M M “distributed memory” approach increasingly common “shared memory” now: mostly hybrid
  16. 16. Shared Memory Machines Have lots of CPUs share the same memory banks Spawn lots of threads Each writes to globally shared memory Multicore CPUs now ubiquitous Most computers now ‘shared memory machines’slide by Richard Edgar
  17. 17. Shared Memory Machines NASA ‘Columbia’ Computer Up to 2048 cores in single systemslide by Richard Edgar
  18. 18. Shared Memory Machines Spawning lots of threads (relatively) easy pthreads, OpenMP Don’t have to worry about data location Disadvantage is memory performance scaling Frontside bus saturates rapidly Can use Non-Uniform Memory Architecture (NUMA) Silicon Graphics Origin & Altix series Gets expensive very fastslide by Richard Edgar
  19. 19. d variables. uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses needemation exchanged by sending data from one processor to an interconnection network using explicit communication opera M M M PP PP PP P P P Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” “shared memory”d approach increasingly common now: mostly hybrid
  20. 20. Distributed Memory Clusters Alternative is a lot of cheap machines High-speed network between individual nodes Network can cost as much as the CPUs! How do nodes communicate?slide by Richard Edgar
  21. 21. Distributed Memory Clusters NASA ‘Pleiades’ Cluster 51,200 coresslide by Richard Edgar
  22. 22. Distributed Memory Model Communication is key issue Each node has its own address space (exclusive access, no global memory?) Could use TCP/IP Painfully low level Solution: a communication protocol like message- passing (e.g. MPI)slide by Richard Edgar
  23. 23. Distributed Memory Model All data must be explicitly partitionned Exchange of data by explicit communicationslide by Richard Edgar
  24. 24. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  25. 25. Message Passing Interface MPI is a communication protocol for parallel programs Language independent Open standard Originally created by working group at SC92 Bindings for C, C++, Fortran, Python, etc. http://www.mcs.anl.gov/research/projects/mpi/ http://www.mpi-forum.org/slide by Richard Edgar
  26. 26. Message Passing Interface MPI processes have independent address spaces Communicate by sending messages Means of sending messages invisible Use shared memory if available! (i.e. can be used behind the scenes shared memory architectures) On Level 5 (Session) and higher of OSI modelslide by Richard Edgar
  27. 27. OSI Model ?
  28. 28. Message Passing Interface MPI is a standard, a specification, for message-passing libraries Two major implementations of MPI MPICH OpenMPI Programs should work with eitherslide by Richard Edgar
  29. 29. Basic Idea • Usually programmed with SPMD model (single program, multiple data) • In MPI-1 number of tasks is static - cannot dynamically spawn new tasks at runtime. Enhanced in MPI-2. • No assumptions on type of interconnection network; all processors can send a message to any other processor. • All parallelism explicit - programmer responsible for correctly identifying parallelism and implementing parallel algorithmsadapted from Berger & Klöckner (NYU 2010)
  30. 30. Credits: James Carr (OCI)
  31. 31. Hello World #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world from %d of %dn", rank, size); MPI_Finalize(); return 0; }adapted from Berger & Klöckner (NYU 2010)
  32. 32. Hello WorldTo compile: Need to load “MPI” wrappers in addition to thecompiler modules (OpenMPI,‘ MPICH,...) module load mpi/openmpi/1.2.8/gnu module load openmpi/intel/1.3.3To compile: mpicc hello.cTo run: need to tell how many processes you are requesting mpiexec -n 10 a.out (mpirun -np 10 a.out) adapted from Berger & Klöckner (NYU 2010)
  33. 33. The beauty of data visualization http://www.youtube.com/watch?v=pLqjQ55tz-U
  34. 34. The beauty of data visualization http://www.youtube.com/watch?v=pLqjQ55tz-U
  35. 35. Example: gprof2dot
  36. 36. “ They’ve done studies, youknow. 60% of the time, itworks every time... ”- Brian Fantana(Anchorman, 2004)
  37. 37. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  38. 38. Basic MPI MPI is a library of routines Bindings exist for many languages Principal languages are C, C++ and Fortran Python: mpi4py We will discuss C++ bindings from now on http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node287.htmslide by Richard Edgar
  39. 39. Basic MPI MPI allows processes to exchange messages Processes are members of communicators Communicator shared by all is MPI::COMM_WORLD In C++ API, communicators are objects Within a communicator, each process has unique IDslide by Richard Edgar
  40. 40. A Minimal MPI Program #include <iostream> using namespace std; #include “mpi.h” int main( int argc, char* argv ) { Very much a minimal program MPI::Init( argc, argv ); No actual cout << “Hello World!” << endl; communication occurs MPI::Finalize(); return( EXIT_SUCCESS ); }slide by Richard Edgar
  41. 41. A Minimal MPI Program To compile MPI programs use mpic++ mpic++ -o MyProg myprog.cpp The mpic++ command is a wrapper for default compiler Adds in libraries Use mpic++ --show to see what it does Will also find mpicc, mpif77 and mpif90 (usually)slide by Richard Edgar
  42. 42. A Minimal MPI Program To run the program, use mpirun mpirun -np 2 ./MyProg The -np 2 option launches two processes Check documentation for your cluster Number of processes might be implicit Program should print “Hello World” twiceslide by Richard Edgar
  43. 43. Communicators Processes are members of communicators A process can Find the size of a given communicator Determine its ID (or rank) within it Default communicator is MPI::COMM_WORLDslide by Richard Edgar
  44. 44. Communicatorsint nProcs, iMyProc;MPI::Init( argc, argv ); Queries COMM_WORLD communicator fornProcs = MPI::COMM_WORLD.Get_size(); Number of processesiMyProc = MPI::COMM_WORLD.Get_rank(); Current process rank (ID)cout << “Hello from process ”;cout << iMyProc << “ of ”; Prints these outcout << nProcs << endl; Process rank counts from zeroMPI::Finalize();slide by Richard Edgar
  45. 45. Communicators By convention, process with rank 0 is master const int iMasterProc = 0; Can have more than one communicator Process may have different rank within eachslide by Richard Edgar
  46. 46. Messages Haven’t sent any data yet Communicators have Send and Recv methods for this One process posts a Send Must be matched by Recv in the target processslide by Richard Edgar
  47. 47. Sending Messages A sample send is as follows: int a[10]; MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag ); The method prototype is void Comm::Send( const void* buf, int count, const Datatype& datatype, int dest, int tag) const MPI copies the buffer into a system buffer and returns No delivery notificationslide by Richard Edgar
  48. 48. Receiving Messages Similar call to receive MPI::ANY_SOURCE int a[10]; MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, MPI::ANY_TAG iMyTag); Function prototype is void Comm::Recv( void* buf, int count, const Datatype& datatype, int source, int tag) const Blocks until data arrivesslide by Richard Edgar
  49. 49. MPI Datatypes MPI Datatype C/C++ MPI datatypes are MPI::CHAR signed char independent of MPI::SHORT signed short Language MPI::INT signed int Endianess MPI::LONG signed long Most common listed MPI::FLOAT float opposite MPI::DOUBLE double MPI::BYTE Untyped byte dataslide by Richard Edgar
  50. 50. MPI Send & Receive if( iMyProc == iMasterProc ) { for( int i=1; i<nProcs; i++ ) { int iMessage = 2 * i + 1; cout << “Sending ” << iMessage << “ to process ” << i << endl; MPI::COMM_WORLD.Send( &iMessage, 1, Master process sends MPI::INT, i, iTag ); out numbers } } else { int iMessage; Worker processes print MPI::COMM_WORLD.Recv( &iMessage, 1, MPI::INT, out number received iMasterProc, iTag ); cout << “Process ” << iMyProc << “ received ” << iMessage << endl; }slide by Richard Edgar
  51. 51. Six Basic MPI Routines Have now encounted six MPI routines MPI::Init(), MPI::Finalize() MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(), MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv() These are enough to get started ;-) More sophisticated routines available...slide by Richard Edgar
  52. 52. Collective Communications Send and Recv are point-to-point Communicate between specific processes Sometimes we want all processes to exchange data These are called collective communicationsslide by Richard Edgar
  53. 53. Barriers Barriers require all processes to synchronise MPI::COMM_WORLD.Barrier(); Processes wait until all processes arrive at barrier Potential for deadlock Bad for performance Only use if necessaryslide by Richard Edgar
  54. 54. Broadcasts Suppose one process has array to be shared with all int a[10]; MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc ); If process has rank iSrcProc, it will send the array Other processes will receive it All will have a[10] identical to iSrcProc on completionslide by Richard Edgar
  55. 55. MPI Broadcast P0 A P0 A Broadcast A P1 P1 P2 P2 A P3 P3 A MPI Bcast(&buf, count, datatype, root, comm) All processors must call MPI Bcast with the same root value.adapted from Berger & Klöckner (NYU 2010)
  56. 56. Reductions Suppose we have a large array split across processes We want to sum all the elements Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM Also MPI::COMM_WORLD.Allreduce() variant Can perform MAX, MIN, MAXLOC, MINLOC tooslide by Richard Edgar
  57. 57. MPI Reduce P0 A P0 ABCD P1 B Reduce P1 P2 C P2 P3 D P3 Reduction operators can be min, max, sum, multiply, logical ops, max value and location ... Must be associative (commutative optional)adapted from Berger & Klöckner (NYU 2010)
  58. 58. Scatter and Gather Split a large array between processes Use MPI::COMM_WORLD.Scatter() Each process receives part of the array Combine small arrays into one large one Use MPI::COMM_WORLD.Gather() Designated process will construct entire array Has MPI::COMM_WORLD.Allgather() variantslide by Richard Edgar
  59. 59. MPI Scatter/Gather P0 A B C D P0 A Scatter P1 P1 B Gather P2 P2 C P3 P3 Dadapted from Berger & Klöckner (NYU 2010)
  60. 60. MPI Allgather P0 A P0 A B C D B Allgather A B C D P1 P1 P2 C P2 A B C D P3 D P3 A B C Dadapted from Berger & Klöckner (NYU 2010)
  61. 61. MPI Alltoall P0 A0 A1 A2 A3 P0 A0 B0 C0 D0 B0 B1 B2 B3 Alltoall A1 B1 C1 D1 P1 P1 P2 C0 C1 C2 C3 P2 A2 B2 C2 D2 P3 D0 D1 D2 D3 P3 A3 B3 C3 D3adapted from Berger & Klöckner (NYU 2010)
  62. 62. Asynchronous Messages An asynchronous API exists too Have to allocate buffers Have to check if send or receive has completed Will give better performance Trickier to useslide by Richard Edgar
  63. 63. User-Defined Datatypes Usually have complex data structures Require means of distributing these Can pack & unpack manually MPI allows us to define own datatypes for thisslide by Richard Edgar
  64. 64. MPI-2 • One-sided RMA (remote memory access) communication • potential for greater efficiency, easier programming. • Use ”windows” into memory to expose regions for access • Race conditions now possible. • Parallel I/O like message passing but to file system not other processes. • Allows for dynamic number of processes and inter-communicators (as opposed to intra-communicators) • Cleaned up MPI-1adapted from Berger & Klöckner (NYU 2010)
  65. 65. RMA • Processors can designate portions of its address space as available to other processors for read/write operations (MPI Get, MPI Put, MPI Accumulate). • RMA window objects created by collective window-creation fns. (MPI Win create must be called by all participants) • Before accessing, call MPI Win fence (or other synchr. mechanisms) to start RMA access epoch; fence (like a barrier) separates local ops on window from remote ops • RMA operations are no-blocking; separate synchronization needed to check completion. Call MPI Win fence again. RMA window Put P0 local memory P1 local memoryadapted from Berger & Klöckner (NYU 2010)
  66. 66. Some MPI Bugs
  67. 67. MPIMPI Bugs Sample Bugs Only works for even number of processors. What’s w rong?adapted from Berger & Klöckner (NYU 2010)
  68. 68. MPIMPI Bugs Sample Bugs Only works for even number of processors.adapted from Berger & Klöckner (NYU 2010)
  69. 69. MPI Bugs Sample MPI Bugs Suppose you have a local variable “energy” and you want to sum all the processors “energy” to and wanttotal energy Supose have local variable, e.g. energy, find the to sum all of the system energy to find total energy of the system. the processors Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) hat’s w rong? Using the same variable, as in W MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD)adapted from Berger & Klöckner (NYU 2010)
  70. 70. Communication Topologies
  71. 71. Communication Topologies Some topologies very common Grid, hypercube etc. API provided to set up communicators following theseslide by Richard Edgar
  72. 72. Parallel Performance Recall Amdahl’s law: if T1 = serial cost + parallel cost then Tp = serial cost + parallel cost/p But really Tp = serial cost + parallel cost/p + Tcommunication How expensive is it?adapted from Berger & Klöckner (NYU 2010)
  73. 73. Network Characteristics Interconnection network connects nodes, transfers data Important qualities: • Topology - the structure used to connect the nodes • Routing algorithm - how messages are transmitted between processors, along which path (= nodes along which message transferred). • Switching strategy = how message is cut into pieces and assigned a path • Flow control (for dealing with congestion) - stall, store data in buffers, re-route data, tell source to halt, discard, etc.adapted from Berger & Klöckner (NYU 2010)
  74. 74. Interconnection Network Represent as graph G = (V , E), V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by: • diameter - maximum over all pairs of nodes of the shortest path between the nodes (length of path in message transmission) • degree - number of direct links for a node (number of direct neighbors) • bisection bandwidth - minimum number of edges that must be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously) • node/edge connectivity - numbers of node/edges that must fail to disconnect the network (measure of reliability)adapted from Berger & Klöckner (NYU 2010)
  75. 75. Linear Array • p vertices, p − 1 links • Diameter = p − 1 • Degree = 2 • Bisection bandwidth = 1 • Node connectivity = 1, edge connectivity = 1adapted from Berger & Klöckner (NYU 2010)
  76. 76. Ring topology • diameter = p/2 • degree = 2 • bisection bandwidth = 2 • node connectivity = 2 edge connectivity = 2adapted from Berger & Klöckner (NYU 2010)
  77. 77. Mesh topology √ • diameter = 2( p − 1) √ 3d mesh is 3( 3 p − 1) • degree = 4 (6 in 3d ) √ • bisection bandwidth p • node connectivity 2 edge connectivity 2 Route along each dimension in turnadapted from Berger & Klöckner (NYU 2010)
  78. 78. Torus topology Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over meshadapted from Berger & Klöckner (NYU 2010)
  79. 79. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 01001100 1101 0101 100 101 1000 1001 00 01 000 001 0000 0001 • p = 2k processors labelled with binary numbers of length k • k -dimensional cube constructed from two (k − 1)-cubes • Connect corresponding procs if labels differ in 1 bit (Hamming distance d between 2 k -bit binary words = path of length d between 2 nodes)adapted from Berger & Klöckner (NYU 2010)
  80. 80. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 01001100 1101 0101 100 101 1000 1001 00 01 000 001 0000 0001 • diameter = k ( =log p) • degree = k • bisection bandwidth = p/2 • node connectivity k edge connectivity kadapted from Berger & Klöckner (NYU 2010)
  81. 81. Dynamic Networks Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection. • bus • crossbar • multistage network - e.g. butterfly, omega, baselineadapted from Berger & Klöckner (NYU 2010)
  82. 82. Crossbar P1 P2 Pn M1 M2 Mm • Connecting n inputs and m outputs takes nm switches. (Typically only for small numbers of processors) • At each switch can either go straight or change dir. • Diameter = 1, bisection bandwidth = padapted from Berger & Klöckner (NYU 2010)
  83. 83. Butterfly 16 × 16 butterfly network: stage 0 stage 1 stage 2 stage 3 000 001 010 011 100 101 110 111 for p = 2k +1 processors, k + 1 stages, 2k switches per stage, 2 × 2 switchesadapted from Berger & Klöckner (NYU 2010)
  84. 84. Fat tree • Complete binary tree • Processors at leaves • Increase links for higher bandwidth near rootadapted from Berger & Klöckner (NYU 2010)
  85. 85. Current picture • Old style: mapped algorithms to topologies • New style: avoid topology-specific optimizations • Want code that runs on next year’s machines too. • Topology awareness in vendor MPI libraries? • Software topology - easy of programming, but not used for performance?adapted from Berger & Klöckner (NYU 2010)
  86. 86. Should we care ?• Old school: map algorithms to specific topologies• New school: avoid topology-specific optimimizations (the code should be optimal on next year’s infrastructure....)• Meta-programming / Auto-tuning ?
  87. 87. chart in table format using the statistics page. A direct link to the statistics is alsoavailable. Top500 Interconnects Statisti Top500 06/20 Statisti Vendo Genera Search adapted from Berger & Klöckner (NYU 2010)
  88. 88. MPI References • Lawrence Livermore tutorial https:computing.llnl.gov/tutorials/mpi/ • Using MPI Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum • Using MPI-2 Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur • Lots of other on-line tutorials, books, etc.adapted from Berger & Klöckner (NYU 2010)
  89. 89. Ignite: Google Trends http://www.youtube.com/watch?v=m0b-QX0JDXc
  90. 90. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  91. 91. MPI with CUDA MPI and CUDA almost orthogonal Each node simply becomes faster Problem matching MPI processes to GPUs Use compute-exclusive mode on GPUs Tell cluster environment to limit processes per node Have to know your cluster documentationslide by Richard Edgar
  92. 92. Data Movement Communication now very expensive GPUs can only communicate via their hosts Very laborious Again: need to minimize communicationslide by Richard Edgar
  93. 93. MPI Summary MPI provides cross-platform interprocess communication Invariably available on computer clusters Only need six basic commands to get started Much more sophistication availableslide by Richard Edgar
  94. 94. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  95. 95. ZeroMQ • ‘messaging middleware’ ‘TCP on steroids’ ‘new layer on the networking stack’ • not a complete messaging system • just a simple messaging library to be used programmatically. • a “pimped” socket interface allowing you to quickly design / build a complex communication system without much efforthttp://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
  96. 96. ZeroMQ • Fastest. Messaging. Ever. • Excellent documentation: • examples • white papers for everything • Bindings for Ada, Basic, C, Chicken Scheme, Common Lisp, C#, C++, D, Erlang*, Go*, Haskell*, Java, Lua, node.js, Objective-C, ooc, Perl, Perl, PHP, Python, Racket, Ruby,Tclhttp://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
  97. 97. Message Patternshttp://nichol.as/zeromq-an-introduction http://zguide.zeromq.org/page:all
  98. 98. Demo: Why ZeroMQ ?http://www.youtube.com/watch?v=_JCBphyciAs
  99. 99. MPI vs ZeroMQ ?• MPI is a specification, ZeroMQ is an implementation.• Design: • MPI is designed for tightly-coupled compute clusters with fast and reliable networks. • ZeroMQ is designed for large distributed systems (web-like).• Fault tolerance: • MPI has very limited facilities for fault tolerance (the default error handling behavior in most implementations is a system-wide fail, ouch!). • ZeroMQ is resilient to faults and network instability.• ZeroMQ could be a good transport layer for an MPI-like implementation. http://stackoverflow.com/questions/35490/spread-vs-mpi-vs-zeromq
  100. 100. F ast Fo rward CUDASA
  101. 101. !"#$%&#$"()*"#+,CUDASA: Computed Unified Device &9(8:/#1;/( Systems Architecture1234J1/.(8#$2%,0,$&3$C,B$4*BI,#$B#8*$ 78 !"#$%&()*)++$+,-./012340/*)-,%5+$672#/ -./(01%102 .8+#,9672-:-#$.- ;<=>?$-+)>@8)&*/7+$">AAA 31#4"56(01%102 672B+8-#$*$%C,*/%.$%#- D:*,%$#>=%0,%,E)%&>D:*,FG6>AAA 1/%-,-#$%#&$C$+/($*,%#$*0)B$ !)-,+:$.H$&&$&#/#I$1234B/.(,+$(*/B$--!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  102. 102. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() *&,56$(8(6+.507$+75.$9*:#;<$-.%#+./0%1239=(>56(; ;-.*-& ?4061,$57$3&)&44(4 -0$60@@+756&.507$?(./((7$;-.*-& ?4061, #7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6( -0$60=($@0=5A56&.507$)(E+5)(=!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  103. 103. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() *8#9$-./01./0) -./234":;0,.< =7($*8#$(>+&4,$07($"=?@A$.;)(&B (%#; C4061,$57$3&)&44(4 244$C4061,$,;&)($60DD07$,,.(D$D(D0)$:(>EF$G#H2$I40C&4$D(D0)< J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  104. 104. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() *8#9$#+-./%0/1#$%*23:70;(< =7($*8#$(>+&4,$07($?"@$3)06(,, ?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4 -0$57.)57,56$A40B&4$C(C0)$ D5,.)5B+.(;$,E&)(;$C(C0)$C&7&A(C(7. F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  105. 105. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() 8(9+(7.5&4$&33456&.507$3)06(,, 2):5.)&)$;<;==$&33456&.507$60>( 24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,,.(A$A(A0) B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  106. 106. !"#$%"$&()*&#+,-#+!"#$%&( )#$*+#,%-.!)/0.12#+3.+#-.1%4+#5$1.6"7.89).:+2%7 ;5#54+:.1%."6.%3%#15"#1.6"7.+--55"#+:.+<17+$5"#.:+2%71 8%#%7+:5=%.&7%1%#.&7",7+445#,.&+7+-5,4 ;545$.89).5#%76+$%.6"7.+::.#%>.?@)1 97",7+44+<5:52 A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+5"#!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  107. 107. !"#$%"$&()*&#+,-#+./-#*01!"#$%&()*+!+,-.&/!"#$%&()*+,-.&/ !!"#$%&#!!($)*"+,-./0#$&123&4&5616478999:;;<=>?@= 925,34-5 :2"%464&8 !!A$B1!!($)*A+,-./99978 ;;CDEF 999 01&,234-5 "+,-.GGG<"H<%HIBJJJ/3&4&561647K(-564728"34-5 : !!1&BL!! ($)*1+,-./0#$&123&4&5616478 ;;CDEF =&> A+,-./3&4&561647K 925,34-5 :2"%464&8 : !!B6M,6-.6!! ($)*B+,-./99978 ;;NOO ;&5&8"%4<&. 1+,-.GGG<"JJJ/3&4&561647K 01&,234-5(-564728"34-5 :!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  108. 108. !"#$%"$&()*&#+,-#+./-#*01!"#$%&$()* +,-)#./ 0*$.%*&1 23(1$4(*#&--1(&$()*51&6.% !!"#$%#&#!!*.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012"3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"50129:;51&6.% !!78)*48!! !!+#91#!! 7:1+012./*8)5,+-./ *8)5012./36:#4+,+- +,-)#./5<3*$()*#5&%.5&.##("1.5<%)=5*.,$5>(?>.%5&"#$%&$()* 23(1$4(*#5&%.5&3$)=&$(&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  109. 109. !"#$%"%&()*& !"#$%&()%*+,+- ;)< ,#.((%#= /01."2%#3 4.2(567/(+8"/$(#9(9:+;2:,. <%/$+%*"$.(/1,+.(&&:+-(=:+(#$9:+;(2,-+ >:&&:#(%#$+=,0(="#0$%:#/(?@3@(,$:&%0(="#0$%:#/A >7<BCB(>:&D%2+ >:.($+,#/2,$%:#(=+:&(>7<BCB(0:.($:(>7<B(9%$1($1+,./EFG4 C2=H0:#$,%#.(D+H0:&D%2+($:(>7<B(0:&D%2+(D+:0// 5,/.(:#(62/,(?>II(D,+/+A(9%$1(,...(>7<BE>7<BCB(="#0$%:#,2%$- J"22(,#,2-/%/(:=(/-#$,K(,#.(/&,#$%0/(+8"%+.!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::
  110. 110. !"#$%"%&()*&+,-./,0(1%2 !"#$%&()*+(,-#./&#,0*.1 !!"#$%!!&()*&"+,-./)-"&)0&12(#"&314&5&666&7 "89:*:1&$";,."&5;/-<+=0.+ )-"&)<&12(#"&31<& >:0&,<0./ *)=>&"#$%?*@0&"#$%A)=< 7&B;#99:;!$";,."!"+,-.<23456(,7-#+( ()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5 )*$%#,08&( )-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1< /09.#,:- *)=>&"#$%?*@&C&9#;#=DE"#$%?*@< *)=>&"#$%A)=&C&9#;#=DE"#$%A)=< 3-090.#&( 5&666&7=:.),0*.(8*+? 7 !"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  111. 111. !"#$%"%&()*&+,-./,0(1%2!"#$%&()*+(,-#./&#,0*.1 /012#333#%&#4445!"#$672.-#,+()*+(&#3*4, 5*%3(64.),0*.(%#-#$,-/(0.,*(7-#%%-(/,-4), 8 !"#$#9 :*%4&#,(/);+4&-(<44(70,;(#&&(=&*)>/(*6(,;((%#<9."=+ 8 %&#9 ?,-$0.(=40&,@0./(6*-(#);(=&*)> 8 ()*+,-"#()*%!. 9 A#>(4%(B!C(7*->-(,;-#+/(6-*$(,;(,;-#+(%**& D+&(B!C/(-<4/,(.",(%.+0.E(=&*)>(6-*$(<44 A#0,(6*-(#&&(=&*)>/(,*(=(%-*)//+ D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  112. 112. !"#$%"%&()*&+,-%.*/0,1(2%/!"#$%&()*+(&,") -)(+".%&"%/0*%+(1$%0*,).%234%,)&$5(6$ 789*%0)%6":;,+$<&,:$%.$)$(&$#%$=$)&%+"";%>$?=@%&"%&A$(#%;""+B -;;+,6(&,")%,**0$*%/"(#6(*&%:$**(.$*%&" ,**0$%$C$60&,")%<)=+8$*/(")*# ;$5":%*A($#%#,*&,/0&$#%:$:"1%";$(&,")*!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  113. 113. !"#$%"%&()*&+,-%.*/0,1(2%/ !"#$%&&()*$(+,*%&-%-.$/ 01#+2%3.-4,*#*(.1)%53%%&(16)/)*%--%-.$/.7)(162%1.&% 0#3"32,)*%$1.&%&%&(3#*%)4#$*.7)/)*%--%-.$/7.$8!9 :.1*(1,.,);($*,#2#&&$%))$#16% <1*%$7#3%;(#!"#$%$&$(!)*!"#$%$&+,!-.)*!"#$%$/0++ 0;=>*.:?8@62.+#2-%-.$/-#1#6%-%1* <-42%-%1*%&,)(169A<B%-.*%9%-.$/@33%))CB9@D E.6,#$#1*(%)7.$3.13,$$%1*1.1F#*.-(3-%-.$/#33%))%)C#)(1:?8@D :?8@!@#*.-(37,13*(.1)7.$8!9!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  114. 114. !"#$%&#()$#(*+,+%%"%-#.!"#$%&#"%()%*+,-%."/"01%2$034%251$3617%8-9:;;< *531=%>/%$>6%>?%@A*+,-%-9:;;%13B007%?5/&$3>/ *1>&CDB#"=%#5BD2$034%60>&"##3/.%6613"=%>/%11%1"E"1#%>?%6011"13#2 @FA) ,;G%H6$"0>/%IJKL%I4I%&>0"# M/$"1%NOOKKL%P%&>0"# 9FA) QRMGM,%N5=0>%STUOKK QRMGM,%VVKK9!T%A1$0 I%&0=#% 8(OW(O%1/"#< QRMGM,%VVKK9! I%&0=#% 8(OWP%1/"#< X%&0=#% 8(OWPWP%1/"#< P%&0=#% 8(OWPWPWP%1/"#<!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  115. 115. !"#$%&#()$#(*+,+%%"%-#.(/012&34!"#$%&#"%()%*+,-./0,1"2%03%410,1%511+256$506 1<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")* 78%9&:#,&:"/%"$%18;%(<<= >/$5$506%03%11%#&"6"%"1"2"6$#%56%$#?%6@%?"/6"1%,10&?# A6530/2%@5/"&$5061%/@56&"%@5#$/5,+$506%BC(D%#2.1"#E J+2,"/%03% C%G>A (%G>A# M%G>A# #&"6"%"1"2"6$# N(=OD P(O (ON C(Q CNC<=( C<N< P(< (OP F6%25115#"&06@#%30/%%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  116. 116. !"#$%&#()"&*+,-(./,/%%"%0#1 !"##$%#$%&#&()*+%,-.//%0123%(*4#"5%)6"677(7218 9(1*%&61(%:+ ;%&701*("%#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("(* ;B???; 86*"2&(1% 234(56%+7# <)= &#8)0*6*2#+% :CD1% <)= &#8802&6*2#+ ECF1 G23@%&#8802&6*2#%&#1*1 ,237(%!H%42*@%E%-!I1%01(<%61%E%1237(%-!I%&701*("%#<(1 :?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#7L+%M:E%-N7#)1O P(Q02"(1%2*("R)"#&(11%&#8802&6*2# S,/%6&&(11(1%*65(%:CB%*28(1%7#3("%*@6%&#8)0*6*2# T#%646"((11%#$%<6*6%7#&672*L G23@%J0(&(116"LO%&#8802&6*2#%#K("@(6<!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  117. 117. !"#$%&("#!"#$%$&()*+,-./,*/!"#$0/1+,234.,536-7,+*8/19:21244+4.-; <.,.;24=>2,5+-*//1.5.,2442,5625+ ?/8:1/512;;.,52,@4+21,.,5/A+1>+2@ B//@-=24.,53+>2A./1/,36-4+A+4 (-:+=.244C0/1A+1C4215+*215+*=/;:6*2*./,- (2-C*/.,*+512*+.,*/*>+!"#$@+A+4/:;+,*:1/=+-- !611+,*:1/D+=*-*2*+& ()*+,-./,0/1!"#$%$*/2@@2821+,+--/0@2*24/=24.*C E@+2&!24432=9;+=>2,.-;.,-<-/$(")*+/)*8"9$.%(")* <.,.;.F+2;/6,*/0#%<@2*2*/3+=/;;6,.=2*+@ $6*/;2*.=244C;29+6-+/02-C,=>1/,/6-@2*2*12,-0+1*/*>+BG"- G1+:212*./,-0/1;29.,5!"#$%$:634.=4C2A2.4234+!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  118. 118. F ast Fo rward MultiGPU MapReduce
  119. 119. MapReduce http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
  120. 120. Why MapReduce?• Simple programming model• Parallel programming model• Scalable• Previous GPU work: neither multi-GPU nor out-of-core
  121. 121. Benchmarks—Which• Matrix Multiplication (MM)• Word Occurrence (WO)• Sparse-Integer Occurrence (SIO)• Linear Regression (LR)• K-Means Clustering (KMC)• (Volume Renderer—presented 90 minutes ago @ MapReduce ’10)
  122. 122. Benchmarks—Why• Needed to stress aspects of GPMR • Unbalanced work (WO) • Multiple emits/Non-uniform number of emits (LR, KMC, WO) • Sparsity of keys (SIO) • Accumulation (WO, LR, KMC) • Many key-value pairs (SIO) • Compute Bound Scalability (MM)
  123. 123. Benchmarks—Results
  124. 124. WO. We test GPMR against all available input sets.Benchmarks—Results MM KMC LR SIO WO 1-GPU Speedup 162.712 2.991 1.296 1.450 11.080 4-GPU Speedup 559.209 11.726 4.085 2.322 18.441vs. CPU TABLE 2: Speedup for GPMR over Phoenix on our large (second- biggest) input data from our first set. The exception is MM, for which we use our small input set (Phoenix required almost twenty seconds TAB to multiply two 1024 × 1024 matrices). writ all littl boil MM KMC WO func GPM 1-GPU Speedup 2.695 37.344 3.098 of t 4-GPU Speedup 10.760 129.425 11.709vs. GPU TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix Multiplication, an 8M-point K-Means Clustering, and a 512 MB Word Occurrence. These sizes represent the largest problems that can meet the in-core memory requirements of Mars.
  125. 125. Benchmarks - Results Good
  126. 126. Benchmarks - Results Good
  127. 127. Benchmarks - Results Good
  128. 128. one more thing or two...
  129. 129. Life/Code Hacking #3 The Pomodoro Technique
  130. 130. Life/Code Hacking #3 The Pomodoro Techniquehttp://lifehacker.com/#!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions
  131. 131. http://www.youtube.com/watch?v=QYyJZOHgpco
  132. 132. CO ME