Your SlideShare is downloading. ×
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)


Published on

Published in: Education, Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #7: GPU Cluster Programming | March 8th, 2011 Nicolas Pinto (MIT, Harvard)
  • 2. Administrativia• Homeworks: HW2 due Mon 3/14/11, HW3 out Fri 3/11/11• Project info:• Project ideas:• Project proposal deadline: Fri 3/25/11 (but you should submit way before to start working on it asap)• Need a private private repo for your project? Let us know! Poll on the forum:
  • 3. Goodies• Guest Lectures: 14 distinguished speakers• Schedule updated (see website)
  • 4. Goodies (cont’d)• Amazon AWS free credits coming soon (only for students who completed HW0+1)• It’s more than $14,000 donation for the class!• Special thanks: Kurt Messersmith @ Amazon
  • 5. Goodies (cont’d)• Best Project Prize: Tesla C2070 (Fermi) Board• It’s more than $4,000 donation for the class!• Special thanks: David Luebke & Chandra Cheij @ NVIDIA
  • 6. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  • 7. Todayyey!!
  • 8. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  • 9. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  • 10. The Problem Many computational problems too big for single CPU Lack of RAM Lack of CPU cycles Want to distribute work between many CPUsslide by Richard Edgar
  • 11. Types of Parallelism Some computations are ‘embarrassingly parallel’ Can do a lot of computation on minimal data RC5 DES, SETI@HOME etc. Solution is to distribute across the Internet Use TCP/IP or similarslide by Richard Edgar
  • 12. Types of Parallelism Some computations very tightly coupled Have to communicate a lot of data at each step e.g. hydrodynamics Internet latencies much too high Need a dedicated machineslide by Richard Edgar
  • 13. Tightly Coupled Computing Two basic approaches Shared memory Distributed memory Each has advantages and disadvantagesslide by Richard Edgar
  • 14. dvariables. variables.uted memory private memory for each processor, only acces uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede ocessor, so no synchronization for memory accesses needemationexchanged by sending data from one processor to ano ation exchanged by sending data from one processor to an interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M PP PP PP PP PP PP Interconnection Network Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” approach increasingly common “shared memory”d approach increasingly common now: mostly hybrid
  • 15. variables.uted memory private memory for each processor, only acces Some terminologyocessor, so no synchronization for memory accesses needeation exchanged by sending data from one processor to ano interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M P P P PP PP PP Interconnection Network Interconnection Network Interconnection Network M M M “distributed memory” approach increasingly common “shared memory” now: mostly hybrid
  • 16. Shared Memory Machines Have lots of CPUs share the same memory banks Spawn lots of threads Each writes to globally shared memory Multicore CPUs now ubiquitous Most computers now ‘shared memory machines’slide by Richard Edgar
  • 17. Shared Memory Machines NASA ‘Columbia’ Computer Up to 2048 cores in single systemslide by Richard Edgar
  • 18. Shared Memory Machines Spawning lots of threads (relatively) easy pthreads, OpenMP Don’t have to worry about data location Disadvantage is memory performance scaling Frontside bus saturates rapidly Can use Non-Uniform Memory Architecture (NUMA) Silicon Graphics Origin & Altix series Gets expensive very fastslide by Richard Edgar
  • 19. d variables. uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses needemation exchanged by sending data from one processor to an interconnection network using explicit communication opera M M M PP PP PP P P P Interconnection Network Interconnection Network Interconnection Network M M M M M M “distributed memory” “shared memory”d approach increasingly common now: mostly hybrid
  • 20. Distributed Memory Clusters Alternative is a lot of cheap machines High-speed network between individual nodes Network can cost as much as the CPUs! How do nodes communicate?slide by Richard Edgar
  • 21. Distributed Memory Clusters NASA ‘Pleiades’ Cluster 51,200 coresslide by Richard Edgar
  • 22. Distributed Memory Model Communication is key issue Each node has its own address space (exclusive access, no global memory?) Could use TCP/IP Painfully low level Solution: a communication protocol like message- passing (e.g. MPI)slide by Richard Edgar
  • 23. Distributed Memory Model All data must be explicitly partitionned Exchange of data by explicit communicationslide by Richard Edgar
  • 24. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  • 25. Message Passing Interface MPI is a communication protocol for parallel programs Language independent Open standard Originally created by working group at SC92 Bindings for C, C++, Fortran, Python, etc. by Richard Edgar
  • 26. Message Passing Interface MPI processes have independent address spaces Communicate by sending messages Means of sending messages invisible Use shared memory if available! (i.e. can be used behind the scenes shared memory architectures) On Level 5 (Session) and higher of OSI modelslide by Richard Edgar
  • 27. OSI Model ?
  • 28. Message Passing Interface MPI is a standard, a specification, for message-passing libraries Two major implementations of MPI MPICH OpenMPI Programs should work with eitherslide by Richard Edgar
  • 29. Basic Idea • Usually programmed with SPMD model (single program, multiple data) • In MPI-1 number of tasks is static - cannot dynamically spawn new tasks at runtime. Enhanced in MPI-2. • No assumptions on type of interconnection network; all processors can send a message to any other processor. • All parallelism explicit - programmer responsible for correctly identifying parallelism and implementing parallel algorithmsadapted from Berger & Klöckner (NYU 2010)
  • 30. Credits: James Carr (OCI)
  • 31. Hello World #include <mpi.h> #include <stdio.h> int main(int argc, char** argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world from %d of %dn", rank, size); MPI_Finalize(); return 0; }adapted from Berger & Klöckner (NYU 2010)
  • 32. Hello WorldTo compile: Need to load “MPI” wrappers in addition to thecompiler modules (OpenMPI,‘ MPICH,...) module load mpi/openmpi/1.2.8/gnu module load openmpi/intel/1.3.3To compile: mpicc hello.cTo run: need to tell how many processes you are requesting mpiexec -n 10 a.out (mpirun -np 10 a.out) adapted from Berger & Klöckner (NYU 2010)
  • 33. The beauty of data visualization
  • 34. The beauty of data visualization
  • 35. Example: gprof2dot
  • 36. “ They’ve done studies, youknow. 60% of the time, itworks every time... ”- Brian Fantana(Anchorman, 2004)
  • 37. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  • 38. Basic MPI MPI is a library of routines Bindings exist for many languages Principal languages are C, C++ and Fortran Python: mpi4py We will discuss C++ bindings from now on by Richard Edgar
  • 39. Basic MPI MPI allows processes to exchange messages Processes are members of communicators Communicator shared by all is MPI::COMM_WORLD In C++ API, communicators are objects Within a communicator, each process has unique IDslide by Richard Edgar
  • 40. A Minimal MPI Program #include <iostream> using namespace std; #include “mpi.h” int main( int argc, char* argv ) { Very much a minimal program MPI::Init( argc, argv ); No actual cout << “Hello World!” << endl; communication occurs MPI::Finalize(); return( EXIT_SUCCESS ); }slide by Richard Edgar
  • 41. A Minimal MPI Program To compile MPI programs use mpic++ mpic++ -o MyProg myprog.cpp The mpic++ command is a wrapper for default compiler Adds in libraries Use mpic++ --show to see what it does Will also find mpicc, mpif77 and mpif90 (usually)slide by Richard Edgar
  • 42. A Minimal MPI Program To run the program, use mpirun mpirun -np 2 ./MyProg The -np 2 option launches two processes Check documentation for your cluster Number of processes might be implicit Program should print “Hello World” twiceslide by Richard Edgar
  • 43. Communicators Processes are members of communicators A process can Find the size of a given communicator Determine its ID (or rank) within it Default communicator is MPI::COMM_WORLDslide by Richard Edgar
  • 44. Communicatorsint nProcs, iMyProc;MPI::Init( argc, argv ); Queries COMM_WORLD communicator fornProcs = MPI::COMM_WORLD.Get_size(); Number of processesiMyProc = MPI::COMM_WORLD.Get_rank(); Current process rank (ID)cout << “Hello from process ”;cout << iMyProc << “ of ”; Prints these outcout << nProcs << endl; Process rank counts from zeroMPI::Finalize();slide by Richard Edgar
  • 45. Communicators By convention, process with rank 0 is master const int iMasterProc = 0; Can have more than one communicator Process may have different rank within eachslide by Richard Edgar
  • 46. Messages Haven’t sent any data yet Communicators have Send and Recv methods for this One process posts a Send Must be matched by Recv in the target processslide by Richard Edgar
  • 47. Sending Messages A sample send is as follows: int a[10]; MPI::COMM_WORLD.Send( a, 10, MPI::INT, iTargetProc, iTag ); The method prototype is void Comm::Send( const void* buf, int count, const Datatype& datatype, int dest, int tag) const MPI copies the buffer into a system buffer and returns No delivery notificationslide by Richard Edgar
  • 48. Receiving Messages Similar call to receive MPI::ANY_SOURCE int a[10]; MPI::COMM_WORLD.Recv( a, 10, MPI::INT, iSrcProc, MPI::ANY_TAG iMyTag); Function prototype is void Comm::Recv( void* buf, int count, const Datatype& datatype, int source, int tag) const Blocks until data arrivesslide by Richard Edgar
  • 49. MPI Datatypes MPI Datatype C/C++ MPI datatypes are MPI::CHAR signed char independent of MPI::SHORT signed short Language MPI::INT signed int Endianess MPI::LONG signed long Most common listed MPI::FLOAT float opposite MPI::DOUBLE double MPI::BYTE Untyped byte dataslide by Richard Edgar
  • 50. MPI Send & Receive if( iMyProc == iMasterProc ) { for( int i=1; i<nProcs; i++ ) { int iMessage = 2 * i + 1; cout << “Sending ” << iMessage << “ to process ” << i << endl; MPI::COMM_WORLD.Send( &iMessage, 1, Master process sends MPI::INT, i, iTag ); out numbers } } else { int iMessage; Worker processes print MPI::COMM_WORLD.Recv( &iMessage, 1, MPI::INT, out number received iMasterProc, iTag ); cout << “Process ” << iMyProc << “ received ” << iMessage << endl; }slide by Richard Edgar
  • 51. Six Basic MPI Routines Have now encounted six MPI routines MPI::Init(), MPI::Finalize() MPI::COMM_WORLD.Get_size(), MPI::COMM_WORLD.Get_rank(), MPI::COMM_WORLD.Send(), MPI::COMM_WORLD.Recv() These are enough to get started ;-) More sophisticated routines available...slide by Richard Edgar
  • 52. Collective Communications Send and Recv are point-to-point Communicate between specific processes Sometimes we want all processes to exchange data These are called collective communicationsslide by Richard Edgar
  • 53. Barriers Barriers require all processes to synchronise MPI::COMM_WORLD.Barrier(); Processes wait until all processes arrive at barrier Potential for deadlock Bad for performance Only use if necessaryslide by Richard Edgar
  • 54. Broadcasts Suppose one process has array to be shared with all int a[10]; MPI::COMM_WORLD.Bcast( a, 10, MPI::INT, iSrcProc ); If process has rank iSrcProc, it will send the array Other processes will receive it All will have a[10] identical to iSrcProc on completionslide by Richard Edgar
  • 55. MPI Broadcast P0 A P0 A Broadcast A P1 P1 P2 P2 A P3 P3 A MPI Bcast(&buf, count, datatype, root, comm) All processors must call MPI Bcast with the same root value.adapted from Berger & Klöckner (NYU 2010)
  • 56. Reductions Suppose we have a large array split across processes We want to sum all the elements Use MPI::COMM_WORLD.Reduce() with MPI::Op SUM Also MPI::COMM_WORLD.Allreduce() variant Can perform MAX, MIN, MAXLOC, MINLOC tooslide by Richard Edgar
  • 57. MPI Reduce P0 A P0 ABCD P1 B Reduce P1 P2 C P2 P3 D P3 Reduction operators can be min, max, sum, multiply, logical ops, max value and location ... Must be associative (commutative optional)adapted from Berger & Klöckner (NYU 2010)
  • 58. Scatter and Gather Split a large array between processes Use MPI::COMM_WORLD.Scatter() Each process receives part of the array Combine small arrays into one large one Use MPI::COMM_WORLD.Gather() Designated process will construct entire array Has MPI::COMM_WORLD.Allgather() variantslide by Richard Edgar
  • 59. MPI Scatter/Gather P0 A B C D P0 A Scatter P1 P1 B Gather P2 P2 C P3 P3 Dadapted from Berger & Klöckner (NYU 2010)
  • 60. MPI Allgather P0 A P0 A B C D B Allgather A B C D P1 P1 P2 C P2 A B C D P3 D P3 A B C Dadapted from Berger & Klöckner (NYU 2010)
  • 61. MPI Alltoall P0 A0 A1 A2 A3 P0 A0 B0 C0 D0 B0 B1 B2 B3 Alltoall A1 B1 C1 D1 P1 P1 P2 C0 C1 C2 C3 P2 A2 B2 C2 D2 P3 D0 D1 D2 D3 P3 A3 B3 C3 D3adapted from Berger & Klöckner (NYU 2010)
  • 62. Asynchronous Messages An asynchronous API exists too Have to allocate buffers Have to check if send or receive has completed Will give better performance Trickier to useslide by Richard Edgar
  • 63. User-Defined Datatypes Usually have complex data structures Require means of distributing these Can pack & unpack manually MPI allows us to define own datatypes for thisslide by Richard Edgar
  • 64. MPI-2 • One-sided RMA (remote memory access) communication • potential for greater efficiency, easier programming. • Use ”windows” into memory to expose regions for access • Race conditions now possible. • Parallel I/O like message passing but to file system not other processes. • Allows for dynamic number of processes and inter-communicators (as opposed to intra-communicators) • Cleaned up MPI-1adapted from Berger & Klöckner (NYU 2010)
  • 65. RMA • Processors can designate portions of its address space as available to other processors for read/write operations (MPI Get, MPI Put, MPI Accumulate). • RMA window objects created by collective window-creation fns. (MPI Win create must be called by all participants) • Before accessing, call MPI Win fence (or other synchr. mechanisms) to start RMA access epoch; fence (like a barrier) separates local ops on window from remote ops • RMA operations are no-blocking; separate synchronization needed to check completion. Call MPI Win fence again. RMA window Put P0 local memory P1 local memoryadapted from Berger & Klöckner (NYU 2010)
  • 66. Some MPI Bugs
  • 67. MPIMPI Bugs Sample Bugs Only works for even number of processors. What’s w rong?adapted from Berger & Klöckner (NYU 2010)
  • 68. MPIMPI Bugs Sample Bugs Only works for even number of processors.adapted from Berger & Klöckner (NYU 2010)
  • 69. MPI Bugs Sample MPI Bugs Suppose you have a local variable “energy” and you want to sum all the processors “energy” to and wanttotal energy Supose have local variable, e.g. energy, find the to sum all of the system energy to find total energy of the system. the processors Recall MPI_Reduce(sendbuf,recvbuf,count,datatype,op, root,comm) hat’s w rong? Using the same variable, as in W MPI_Reduce(energy,energy,1 MPI_REAL,MPI_SUM, MPI_COMM_WORLD)adapted from Berger & Klöckner (NYU 2010)
  • 70. Communication Topologies
  • 71. Communication Topologies Some topologies very common Grid, hypercube etc. API provided to set up communicators following theseslide by Richard Edgar
  • 72. Parallel Performance Recall Amdahl’s law: if T1 = serial cost + parallel cost then Tp = serial cost + parallel cost/p But really Tp = serial cost + parallel cost/p + Tcommunication How expensive is it?adapted from Berger & Klöckner (NYU 2010)
  • 73. Network Characteristics Interconnection network connects nodes, transfers data Important qualities: • Topology - the structure used to connect the nodes • Routing algorithm - how messages are transmitted between processors, along which path (= nodes along which message transferred). • Switching strategy = how message is cut into pieces and assigned a path • Flow control (for dealing with congestion) - stall, store data in buffers, re-route data, tell source to halt, discard, etc.adapted from Berger & Klöckner (NYU 2010)
  • 74. Interconnection Network Represent as graph G = (V , E), V = set of nodes to be connected, E = direct links between the nodes. Links usually bidirectional - transfer msg in both directions at same time. Characterize network by: • diameter - maximum over all pairs of nodes of the shortest path between the nodes (length of path in message transmission) • degree - number of direct links for a node (number of direct neighbors) • bisection bandwidth - minimum number of edges that must be removed to partition network into two parts of equal size with no connection between them. (measures network capacity for transmitting messages simultaneously) • node/edge connectivity - numbers of node/edges that must fail to disconnect the network (measure of reliability)adapted from Berger & Klöckner (NYU 2010)
  • 75. Linear Array • p vertices, p − 1 links • Diameter = p − 1 • Degree = 2 • Bisection bandwidth = 1 • Node connectivity = 1, edge connectivity = 1adapted from Berger & Klöckner (NYU 2010)
  • 76. Ring topology • diameter = p/2 • degree = 2 • bisection bandwidth = 2 • node connectivity = 2 edge connectivity = 2adapted from Berger & Klöckner (NYU 2010)
  • 77. Mesh topology √ • diameter = 2( p − 1) √ 3d mesh is 3( 3 p − 1) • degree = 4 (6 in 3d ) √ • bisection bandwidth p • node connectivity 2 edge connectivity 2 Route along each dimension in turnadapted from Berger & Klöckner (NYU 2010)
  • 78. Torus topology Diameter halved, Bisection bandwidth doubled, Edge and Node connectivity doubled over meshadapted from Berger & Klöckner (NYU 2010)
  • 79. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 01001100 1101 0101 100 101 1000 1001 00 01 000 001 0000 0001 • p = 2k processors labelled with binary numbers of length k • k -dimensional cube constructed from two (k − 1)-cubes • Connect corresponding procs if labels differ in 1 bit (Hamming distance d between 2 k -bit binary words = path of length d between 2 nodes)adapted from Berger & Klöckner (NYU 2010)
  • 80. Hypercube topology 0110 0111 0 1 0010 0011 110 111 1110 1111 10 11 010 011 1010 1011 01001100 1101 0101 100 101 1000 1001 00 01 000 001 0000 0001 • diameter = k ( =log p) • degree = k • bisection bandwidth = p/2 • node connectivity k edge connectivity kadapted from Berger & Klöckner (NYU 2010)
  • 81. Dynamic Networks Above networks were direct, or static interconnection networks = processors connected directly with each through fixed physical links. Indirect or dynamic networks = contain switches which provide an indirect connection between the nodes. Switches configured dynamically to establish a connection. • bus • crossbar • multistage network - e.g. butterfly, omega, baselineadapted from Berger & Klöckner (NYU 2010)
  • 82. Crossbar P1 P2 Pn M1 M2 Mm • Connecting n inputs and m outputs takes nm switches. (Typically only for small numbers of processors) • At each switch can either go straight or change dir. • Diameter = 1, bisection bandwidth = padapted from Berger & Klöckner (NYU 2010)
  • 83. Butterfly 16 × 16 butterfly network: stage 0 stage 1 stage 2 stage 3 000 001 010 011 100 101 110 111 for p = 2k +1 processors, k + 1 stages, 2k switches per stage, 2 × 2 switchesadapted from Berger & Klöckner (NYU 2010)
  • 84. Fat tree • Complete binary tree • Processors at leaves • Increase links for higher bandwidth near rootadapted from Berger & Klöckner (NYU 2010)
  • 85. Current picture • Old style: mapped algorithms to topologies • New style: avoid topology-specific optimizations • Want code that runs on next year’s machines too. • Topology awareness in vendor MPI libraries? • Software topology - easy of programming, but not used for performance?adapted from Berger & Klöckner (NYU 2010)
  • 86. Should we care ?• Old school: map algorithms to specific topologies• New school: avoid topology-specific optimimizations (the code should be optimal on next year’s infrastructure....)• Meta-programming / Auto-tuning ?
  • 87. chart in table format using the statistics page. A direct link to the statistics is alsoavailable. Top500 Interconnects Statisti Top500 06/20 Statisti Vendo Genera Search adapted from Berger & Klöckner (NYU 2010)
  • 88. MPI References • Lawrence Livermore tutorial • Using MPI Portable Parallel Programming with the Message=Passing Interface by Gropp, Lusk, Skjellum • Using MPI-2 Advanced Features of the Message Passing Interface by Gropp, Lusk, Thakur • Lots of other on-line tutorials, books, etc.adapted from Berger & Klöckner (NYU 2010)
  • 89. Ignite: Google Trends
  • 90. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  • 91. MPI with CUDA MPI and CUDA almost orthogonal Each node simply becomes faster Problem matching MPI processes to GPUs Use compute-exclusive mode on GPUs Tell cluster environment to limit processes per node Have to know your cluster documentationslide by Richard Edgar
  • 92. Data Movement Communication now very expensive GPUs can only communicate via their hosts Very laborious Again: need to minimize communicationslide by Richard Edgar
  • 93. MPI Summary MPI provides cross-platform interprocess communication Invariably available on computer clusters Only need six basic commands to get started Much more sophistication availableslide by Richard Edgar
  • 94. Outline1. The problem2. Intro to MPI3. MPI Basics4. MPI+CUDA5. Other approaches
  • 95. ZeroMQ • ‘messaging middleware’ ‘TCP on steroids’ ‘new layer on the networking stack’ • not a complete messaging system • just a simple messaging library to be used programmatically. • a “pimped” socket interface allowing you to quickly design / build a complex communication system without much effort
  • 96. ZeroMQ • Fastest. Messaging. Ever. • Excellent documentation: • examples • white papers for everything • Bindings for Ada, Basic, C, Chicken Scheme, Common Lisp, C#, C++, D, Erlang*, Go*, Haskell*, Java, Lua, node.js, Objective-C, ooc, Perl, Perl, PHP, Python, Racket, Ruby,Tcl
  • 97. Message Patterns
  • 98. Demo: Why ZeroMQ ?
  • 99. MPI vs ZeroMQ ?• MPI is a specification, ZeroMQ is an implementation.• Design: • MPI is designed for tightly-coupled compute clusters with fast and reliable networks. • ZeroMQ is designed for large distributed systems (web-like).• Fault tolerance: • MPI has very limited facilities for fault tolerance (the default error handling behavior in most implementations is a system-wide fail, ouch!). • ZeroMQ is resilient to faults and network instability.• ZeroMQ could be a good transport layer for an MPI-like implementation.
  • 100. F ast Fo rward CUDASA
  • 101. !"#$%&#$"()*"#+,CUDASA: Computed Unified Device &9(8:/#1;/( Systems Architecture1234J1/.(8#$2%,0,$&3$C,B$4*BI,#$B#8*$ 78 !"#$%&()*)++$+,-./012340/*)-,%5+$672#/ -./(01%102 .8+#,9672-:-#$.- ;<=>?$-+)>@8)&*/7+$">AAA 31#4"56(01%102 672B+8-#$*$%C,*/%.$%#- D:*,%$#>=%0,%,E)%&>D:*,FG6>AAA 1/%-,-#$%#&$C$+/($*,%#$*0)B$ !)-,+:$.H$&&$&#/#I$1234B/.(,+$(*/B$--!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 102. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() *&,56$(8(6+.507$+75.$9*:#;<$-.%#+./0%1239=(>56(; ;-.*-& ?4061,$57$3&)&44(4 -0$60@@+756&.507$?(./((7$;-.*-& ?4061, #7@0=5A5(=$B#C2$3)0D)&@@57D$57.()A&6( -0$60=($@0=5A56&.507$)(E+5)(=!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 103. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() *8#9$-./01./0) -./234":;0,.< =7($*8#$(>+&4,$07($"=?@A$.;)(&B (%#; C4061,$57$3&)&44(4 244$C4061,$,;&)($60DD07$,,.(D$D(D0)$:(>EF$G#H2$I40C&4$D(D0)< J0)140&BKC&4&76(B$,6;(B+457I$0L$C4061,$.0$.;($*8#,!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 104. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() *8#9$#+-./%0/1#$%*23:70;(< =7($*8#$(>+&4,$07($?"@$3)06(,, ?"@$A)0+3$3)06(,,$;)< B4061,$57$3&)&44(4 -0$57.)57,56$A40B&4$C(C0)$ D5,.)5B+.(;$,E&)(;$C(C0)$C&7&A(C(7. F0)140&;GB&4&76(;$,6E(;+457A$0H$B4061,$.0$.E($*8#,!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 105. !"#$%&()%*)+%, 233456&.507 -(./0)1$%&() *+,$%&() !"#$%&() 8(9+(7.5&4$&33456&.507$3)06(,, 2):5.)&)$;<;==$&33456&.507$60>( 24406&.507<?(&4406&.507$0@$>5,.)5:+.(><,,.(A$A(A0) B,,+($0@$@+76.507$6&44,$07$7(./0)1<:+,$4(C(4!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 106. !"#$%"$&()*&#+,-#+!"#$%&( )#$*+#,%-.!)/0.12#+3.+#-.1%4+#5$1.6"7.89).:+2%7 ;5#54+:.1%."6.%3%#15"#1.6"7.+--55"#+:.+<17+$5"#.:+2%71 8%#%7+:5=%.&7%1%#.&7",7+445#,.&+7+-5,4 ;545$.89).5#%76+$%.6"7.+::.#%>.?@)1 97",7+44+<5:52 A5--%#.B#-%7:25#,.4%$*+#5141.6"7.&+7+::%:514C$"44B#5$+5"#!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 107. !"#$%"$&()*&#+,-#+./-#*01!"#$%&()*+!+,-.&/!"#$%&()*+,-.&/ !!"#$%&#!!($)*"+,-./0#$&123&4&5616478999:;;<=>?@= 925,34-5 :2"%464&8 !!A$B1!!($)*A+,-./99978 ;;CDEF 999 01&,234-5 "+,-.GGG<"H<%HIBJJJ/3&4&561647K(-564728"34-5 : !!1&BL!! ($)*1+,-./0#$&123&4&5616478 ;;CDEF =&> A+,-./3&4&561647K 925,34-5 :2"%464&8 : !!B6M,6-.6!! ($)*B+,-./99978 ;;NOO ;&5&8"%4<&. 1+,-.GGG<"JJJ/3&4&561647K 01&,234-5(-564728"34-5 :!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :
  • 108. !"#$%"$&()*&#+,-#+./-#*01!"#$%&$()* +,-)#./ 0*$.%*&1 23(1$4(*#&--1(&$()*51&6.% !!"#$%#&#!!*.$7)%851&6.% !!()*!! !!&)+#!! ()*,+-./()*012"3#51&6.% !!34"5!! !!6)"3!! 34"5,+-./34"50129:;51&6.% !!78)*48!! !!+#91#!! 7:1+012./*8)5,+-./ *8)5012./36:#4+,+- +,-)#./5<3*$()*#5&%.5&.##("1.5<%)=5*.,$5>(?>.%5&"#$%&$()* 23(1$4(*#5&%.5&3$)=&$(&1165-%)-&?&$./5$)5&1153*/.%16(*?51&6.%#!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 109. !"#$%"%&()*& !"#$%&()%*+,+- ;)< ,#.((%#= /01."2%#3 4.2(567/(+8"/$(#9(9:+;2:,. <%/$+%*"$.(/1,+.(&&:+-(=:+(#$9:+;(2,-+ >:&&:#(%#$+=,0(="#0$%:#/(?@3@(,$:&%0(="#0$%:#/A >7<BCB(>:&D%2+ >:.($+,#/2,$%:#(=+:&(>7<BCB(0:.($:(>7<B(9%$1($1+,./EFG4 C2=H0:#$,%#.(D+H0:&D%2+($:(>7<B(0:&D%2+(D+:0// 5,/.(:#(62/,(?>II(D,+/+A(9%$1(,...(>7<BE>7<BCB(="#0$%:#,2%$- J"22(,#,2-/%/(:=(/-#$,K(,#.(/&,#$%0/(+8"%+.!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( ::
  • 110. !"#$%"%&()*&+,-./,0(1%2 !"#$%&()*+(,-#./&#,0*.1 !!"#$%!!&()*&"+,-./)-"&)0&12(#"&314&5&666&7 "89:*:1&$";,."&5;/-<+=0.+ )-"&)<&12(#"&31<& >:0&,<0./ *)=>&"#$%?*@0&"#$%A)=< 7&B;#99:;!$";,."!"+,-.<23456(,7-#+( ()*&"+,-./B;#99:;!$",."!"+,-.&39#;#=4&5 )*$%#,08&( )-"&)&C&9#;#=DE)<&12(#"&31&C&9#;#=DE1< /09.#,:- *)=>&"#$%?*@&C&9#;#=DE"#$%?*@< *)=>&"#$%A)=&C&9#;#=DE"#$%A)=< 3-090.#&( 5&666&7=:.),0*.(8*+? 7 !"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 111. !"#$%"%&()*&+,-./,0(1%2!"#$%&()*+(,-#./&#,0*.1 /012#333#%&#4445!"#$672.-#,+()*+(&#3*4, 5*%3(64.),0*.(%#-#$,-/(0.,*(7-#%%-(/,-4), 8 !"#$#9 :*%4&#,(/);+4&-(<44(70,;(#&&(=&*)>/(*6(,;((%#<9."=+ 8 %&#9 ?,-$0.(=40&,@0./(6*-(#);(=&*)> 8 ()*+,-"#()*%!. 9 A#>(4%(B!C(7*->-(,;-#+/(6-*$(,;(,;-#+(%**& D+&(B!C/(-<4/,(.",(%.+0.E(=&*)>(6-*$(<44 A#0,(6*-(#&&(=&*)>/(,*(=(%-*)//+ D//40.E(#((%#<9."= 0/(#(=&*)>0.E()#&&!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 112. !"#$%"%&()*&+,-%.*/0,1(2%/!"#$%&()*+(&,") -)(+".%&"%/0*%+(1$%0*,).%234%,)&$5(6$ 789*%0)%6":;,+$<&,:$%.$)$(&$#%$=$)&%+"";%>$?=@%&"%&A$(#%;""+B -;;+,6(&,")%,**0$*%/"(#6(*&%:$**(.$*%&" ,**0$%$C$60&,")%<)=+8$*/(")*# ;$5":%*A($#%#,*&,/0&$#%:$:"1%";$(&,")*!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 113. !"#$%"%&()*&+,-%.*/0,1(2%/ !"#$%&&()*$(+,*%&-%-.$/ 01#+2%3.-4,*#*(.1)%53%%&(16)/)*%--%-.$/.7)(162%1.&% 0#3"32,)*%$1.&%&%&(3#*%)4#$*.7)/)*%--%-.$/7.$8!9 :.1*(1,.,);($*,#2#&&$%))$#16% <1*%$7#3%;(#!"#$%$&$(!)*!"#$%$&+,!-.)*!"#$%$/0++ 0;=>*.:?8@62.+#2-%-.$/-#1#6%-%1* <-42%-%1*%&,)(169A<B%-.*%9%-.$/@33%))CB9@D E.6,#$#1*(%)7.$3.13,$$%1*1.1F#*.-(3-%-.$/#33%))%)C#)(1:?8@D :?8@!@#*.-(37,13*(.1)7.$8!9!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 114. !"#$%&#()$#(*+,+%%"%-#.!"#$%&#"%()%*+,-%."/"01%2$034%251$3617%8-9:;;< *531=%>/%$>6%>?%@A*+,-%-9:;;%13B007%?5/&$3>/ *1>&CDB#"=%#5BD2$034%60>&"##3/.%6613"=%>/%11%1"E"1#%>?%6011"13#2 @FA) ,;G%H6$"0>/%IJKL%I4I%&>0"# M/$"1%NOOKKL%P%&>0"# 9FA) QRMGM,%N5=0>%STUOKK QRMGM,%VVKK9!T%A1$0 I%&0=#% 8(OW(O%1/"#< QRMGM,%VVKK9! I%&0=#% 8(OWP%1/"#< X%&0=#% 8(OWPWP%1/"#< P%&0=#% 8(OWPWPWP%1/"#<!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 115. !"#$%&#()$#(*+,+%%"%-#.(/012&34!"#$%&#"%()%*+,-./0,1"2%03%410,1%511+256$506 1<=&"/"(+!"#">"&"(7+%*?+@*(".%?"%*/-+8).+1*(-.%/("6-+A&)>%&+1&&$<"*%(")* 78%9&:#,&:"/%"$%18;%(<<= >/$5$506%03%11%#&"6"%"1"2"6$#%56%$#?%6@%?"/6"1%,10&?# A6530/2%@5/"&$5061%/@56&"%@5#$/5,+$506%BC(D%#2.1"#E J+2,"/%03% C%G>A (%G>A# M%G>A# #&"6"%"1"2"6$# N(=OD P(O (ON C(Q CNC<=( C<N< P(< (OP F6%25115#"&06@#%30/%%#5641"-60@"%2+1$5-G>A%#H#$"2%I5$:%30+/%JKF9FL%DD<<G!#!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 116. !"#$%&#()"&*+,-(./,/%%"%0#1 !"##$%#$%&#&()*+%,-.//%0123%(*4#"5%)6"677(7218 9(1*%&61(%:+ ;%&701*("%#<(1=%*4#%>>??-91%(6&@=%-236A2*%.*@("(* ;B???; 86*"2&(1% 234(56%+7# <)= &#8)0*6*2#+% :CD1% <)= &#8802&6*2#+ ECF1 G23@%&#8802&6*2#%&#1*1 ,237(%!H%42*@%E%-!I1%01(<%61%E%1237(%-!I%&701*("%#<(1 :?;E?; 86*"2&(1% 489(56%+7#(JA01%7(K(7%#7L+%M:E%-N7#)1O P(Q02"(1%2*("R)"#&(11%&#8802&6*2# S,/%6&&(11(1%*65(%:CB%*28(1%7#3("%*@6%&#8)0*6*2# T#%646"((11%#$%<6*6%7#&672*L G23@%J0(&(116"LO%&#8802&6*2#%#K("@(6<!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 117. !"#$%&("#!"#$%$&()*+,-./,*/!"#$0/1+,234.,536-7,+*8/19:21244+4.-; <.,.;24=>2,5+-*//1.5.,2442,5625+ ?/8:1/512;;.,52,@4+21,.,5/A+1>+2@ B//@-=24.,53+>2A./1/,36-4+A+4 (-:+=.244C0/1A+1C4215+*215+*=/;:6*2*./,- (2-C*/.,*+512*+.,*/*>+!"#$@+A+4/:;+,*:1/=+-- !611+,*:1/D+=*-*2*+& ()*+,-./,0/1!"#$%$*/2@@2821+,+--/0@2*24/=24.*C E@+2&!24432=9;+=>2,.-;.,-<-/$(")*+/)*8"9$.%(")* <.,.;.F+2;/6,*/0#%<@2*2*/3+=/;;6,.=2*+@ $6*/;2*.=244C;29+6-+/02-C,=>1/,/6-@2*2*12,-0+1*/*>+BG"- G1+:212*./,-0/1;29.,5!"#$%$:634.=4C2A2.4234+!"#$%&"%(")*+,-#-%./0+1*#("($(-+2!13435+ 4*"6-.#"(7+)8+3($((9%.( :;
  • 118. F ast Fo rward MultiGPU MapReduce
  • 119. MapReduce
  • 120. Why MapReduce?• Simple programming model• Parallel programming model• Scalable• Previous GPU work: neither multi-GPU nor out-of-core
  • 121. Benchmarks—Which• Matrix Multiplication (MM)• Word Occurrence (WO)• Sparse-Integer Occurrence (SIO)• Linear Regression (LR)• K-Means Clustering (KMC)• (Volume Renderer—presented 90 minutes ago @ MapReduce ’10)
  • 122. Benchmarks—Why• Needed to stress aspects of GPMR • Unbalanced work (WO) • Multiple emits/Non-uniform number of emits (LR, KMC, WO) • Sparsity of keys (SIO) • Accumulation (WO, LR, KMC) • Many key-value pairs (SIO) • Compute Bound Scalability (MM)
  • 123. Benchmarks—Results
  • 124. WO. We test GPMR against all available input sets.Benchmarks—Results MM KMC LR SIO WO 1-GPU Speedup 162.712 2.991 1.296 1.450 11.080 4-GPU Speedup 559.209 11.726 4.085 2.322 18.441vs. CPU TABLE 2: Speedup for GPMR over Phoenix on our large (second- biggest) input data from our first set. The exception is MM, for which we use our small input set (Phoenix required almost twenty seconds TAB to multiply two 1024 × 1024 matrices). writ all littl boil MM KMC WO func GPM 1-GPU Speedup 2.695 37.344 3.098 of t 4-GPU Speedup 10.760 129.425 11.709vs. GPU TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix Multiplication, an 8M-point K-Means Clustering, and a 512 MB Word Occurrence. These sizes represent the largest problems that can meet the in-core memory requirements of Mars.
  • 125. Benchmarks - Results Good
  • 126. Benchmarks - Results Good
  • 127. Benchmarks - Results Good
  • 128. one more thing or two...
  • 129. Life/Code Hacking #3 The Pomodoro Technique
  • 130. Life/Code Hacking #3 The Pomodoro Technique!5554725/the-pomodoro-technique-trains-your-brain-away-from-distractions
  • 131.
  • 132. CO ME