BRASILBasic Resource Aggregation Interface Layer        Eric Van Hensbergen     Pravin Shinde (now at ETH Zurich)        (...
IBM ResearchMotivationData Flow traditional formulation.for a portionProblems  procedure using a Oriented HPC  Figure 7: T...
IBM ResearchPUSH Pipelines               UNIX Model               a|b|c              PUSH Model              a |< b >| c  ...
IBM ResearchPUSH Implementation                                         shell                                Pipe         ...
IBM ResearchBack to Motivation: 64,000 Nodes                                        t                                     ...
IBM ResearchRelated Work/Background Work      • MPI        • “A Compositional Environment for MPI Programs”      • Hadoop ...
IBM ResearchFile System Interfaces                                   Control File Syntax                                  ...
IBM ResearchDistribution and Aggregation                                        5//6+!)708              5//6,!)708        ...
IBM ResearchCommand Line Interface(s)      • Direct File Interaction            echo “res 2” > ./mpoint/csrv/local/0/ctl  ...
IBM ResearchEvaluation Setup                                                 		                                           ...
10       ROSS Workshop                              05/31/2011                © 2011 IBM Corporation
IBM Research                    )*+,-./012/3.,-14+.56                                                      7*68196/*5,96/:...
IBM Research                                             )*+,-./0*12.34,                                              5647...
IBM ResearchDiscussion   • Non-buffered communications channels are great for synchronization, but bad for     performance...
IBM ResearchFuture Work: Cutting Out The Middle Man       App         Brasil                     Brasil          Brasil   ...
IBM ResearchFuture Work: Splice Optimizations via Direct Communication   • Splice operations through name space is elegant...
IBM ResearchFuture Work:   • Turn fan-out/fan-in/many-to-many pipeline       operations into a system primitives with     ...
IBM Research     Brasil Code Available: http://www.bitbucket.org/ericvh/hare/usr/brasil     New Version In Progress: http:...
IBM ResearchCore Concept: BRASILBasic Resource Aggregate System Inferno Layer   •Stripped down Inferno - No GUI or anythin...
IBM ResearchPreferred Embodiment: BRASIL Desktop Extension Model                                                          ...
IBM ResearchCore Concept: Central Services   •Establish hierarchical namespace on cluster services /csrv/                 ...
IBM ResearchOur Approach: Workload Optimized Distribution  21     ROSS Workshop            05/31/2011    © 2011 IBM Corpor...
IBM ResearchOur Approach: Workload Optimized Distribution       Desktop Extension  21        ROSS Workshop         05/31/2...
IBM ResearchOur Approach: Workload Optimized Distribution                       !#$%                                      ...
IBM ResearchOur Approach: Workload Optimized Distribution                                                             !#$%...
IBM ResearchOur Approach: Workload Optimized Distribution                                                             !#$%...
IBM ResearchOur Approach: Workload Optimized Distribution                                                             !#$%...
Evaluations:Deployment and aggregation time                   IBM Research       “Old” Performance Graph (from USENIX)PU3m...
IBM ResearchProblem: Limitations of Traditional Pipes                 AA                      A                           ...
IBM ResearchLong Packet Pipes                 AA                      A                           AAA BBB                 ...
IBM ResearchEnumerated Pipes                       AB          AB                       A        1:A 2:B                  ...
IBM ResearchCollective Pipes                                        A              Broadcast           A                  ...
IBM ResearchSplicing Pipes       spliceto(b)          a   b   =   a   b       splicefrom(b)        a   b   =   a   b  27  ...
IBM ResearchExample Simple Invocation   •mpipefs   •mount /srv/mpipe /n/testpipe   •ls -l /n/testpipe       --rw-rw-rw- M ...
IBM ResearchPassing Arguments via aname   •mount /srv/mpipe /n/test othername   •ls -l /n/test       --rw-rw-rw- M 26 eric...
IBM ResearchExample for writing control blocks       int       pipewrite(int fd, char *data, ulong size, ulong which)     ...
IBM ResearchLarger Example (execfs)       /proc               /clone               /###                    /stdin      mou...
Upcoming SlideShare
Loading in …5
×

Brasil Ross 2011

797 views

Published on

Basic Resource Aggregate Interface Layer. ACM ROSS 2011 Workshop presentation.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
797
On SlideShare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Brasil Ross 2011

  1. 1. BRASILBasic Resource Aggregation Interface Layer Eric Van Hensbergen Pravin Shinde (now at ETH Zurich) (bergevan@us.ibm.com) Noah Evans (now at Bell-Labs) IBM Research Austin ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  2. 2. IBM ResearchMotivationData Flow traditional formulation.for a portionProblems procedure using a Oriented HPC Figure 7: The data dependency graph of the Hartree-Fock 64,000 Node Torus 12 2 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  3. 3. IBM ResearchPUSH Pipelines UNIX Model a|b|c PUSH Model a |< b >| c For more detail: refer to PODC09 Short Paper on PUSH Dataflow Shell 3 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  4. 4. IBM ResearchPUSH Implementation shell Pipe Pipe command Pipe shell Pipe command Pipe shell Pipe command shell shell Pipe Multiplexor Demultiplexor Pipe command command Pipe shell Pipe command Pipe shell Pipe command shell Pipe command Pipe 4 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  5. 5. IBM ResearchBack to Motivation: 64,000 Nodes t L I1 I2 c1 c2 c3 c4 5 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  6. 6. IBM ResearchRelated Work/Background Work • MPI • “A Compositional Environment for MPI Programs” • Hadoop & Other Map/Reduce Solutions • cpu (from Plan 9) - http://plan9.bell-labs.com/magic/man2html/1/cpu • xcpu (from LANL) - http://www.xcpu.org/ • xcpu2 (from LANL) 6 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  7. 7. IBM ResearchFile System Interfaces Control File Syntax • reserve [n] [os] [arch] • dir [wdir] &($" !&* !&0 • exec [command] [args] !&* • kill !$. !$. • killonclose !"#$% • nice [n] !)*+ !)*+ !)*+ !*, • splice [path] !*, !*, !"#2, !"#2, Environment Syntax !-, !3"4. • [key] = [value] !*). !3"4. !,."./, !,."./, Namespace Syntax !,."./, !,.54* • mount [-abcC] [server] [mnt] !$(*) !,.54* !,.5(/. • bind [-abcC] [new] [old] !&0 !,.5(/. • import [-abc] [host] [path] !,.54( !,.54( • cd [dir] !&1 • unmount [new] [old] !&* !&0 • clear !&* • . [path] 6($"!7),(/#$), 8),,4(* 8/9:8),,4(* 7 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  8. 8. IBM ResearchDistribution and Aggregation 5//6+!)708 5//6,!)708 "%() "%() 5//6+ !"#$%&# !"#$%&# !"* !"/&(012!3#,4 !"#$%&# !"#$%&# 2 !"#+ !"%. !"#$%&# !"#$%&# * !"%+ !"/&(012!3*4 !"#$%&# !"#$%&# !"%, !"#+ #+ #, !"#$%&# !"#$%&# %+ %, %- %. !"#, !"%+ !"#$%&# !"#$%&# !"%- !"%, 5//6, !"#$%&# !"#$%&# !"%. !"/&(012!324 !"#$%&# !"#$%&# 8 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  9. 9. IBM ResearchCommand Line Interface(s) • Direct File Interaction echo “res 2” > ./mpoint/csrv/local/0/ctl echo “exec date” > ./mpoint/csrv/local/0/0/ctl echo “exec wc” > cat ./mpoint/csrv/local/0/1/ctl echo “xsplice 0 1” > ./mpoint/csrv/0/local/0/ctl cat ./mpoint/csrv/0/local/0/stdio • Python Command Line runJob.py -n 4 /bin/date • PUSH ORS=blm find . -type f |< xargs chasen | sort | uniq -c >| sort -rn 9 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  10. 10. IBM ResearchEvaluation Setup
  11. 11. 10 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  12. 12. IBM Research )*+,-./012/3.,-14+.56 7*68196/*5,96/:3;6. !/513.=Simple Micro-benchmark: Job start times !# ! !( 19=.?..,*5@ A.B+*5;6*15 C@@B.@;6*15 $ DE.F96*15 G.=.BH;6*15A*+. % # ( ! # $ !% %# !$ % ! !(# (#$ ! ! !#$%()%*%+,-(./0%1%/,%2 11 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  13. 13. IBM Research )*+,-./0*12.34, 5647,63/84,95:;Job Completion Times with I/O !,3*.= ( !$ !% !# ! *8=.?../63@ A.B263C46*3 D@@B.@C46*3 E3/84 !(A62. FG.:846*3 H.=.BIC46*3 $ % # ( ! # $ !% %# !$ % ! !(# (#$ ! ! JK*+= 12 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  14. 14. IBM ResearchDiscussion • Non-buffered communications channels are great for synchronization, but bad for performance as aggregation points become serialization points. • Pushing multiplexing to PUSH a mistake • adds additional copies of data and additional context switches for I/O to do record separation • by pushing them into the infrastructure we can go from 6 copies and context switches down to 2 for each pipeline stage • “beltway buffers” and other techniques may reduce this further or eliminate copies altogether • Transitive mounts of namespace an elegant way to bypass network segmentation -- but it also incurs lots of overhead • Allowing splice inside a reservation has dubious usefulness. It seems like it might be more useful to allow for individual elements to be spawned outside a reservation. • Fan-in and Fan-out are too limiting of a use case. What about deterministic delivery? What about many-to-many pipeline components? What about collectives and barriers? • Fault tolerance fault debugging properties were poor -- no channel for out-of-band communication of error or logging information. When transitive mounts went down, the system wedged -- but no integrated method of figuring out where the system went down. • File systems and communication are outside the scope of this work, but still a sore point for performance with the Plan 9 kernel on Blue Gene. Both issues are actively being worked on with a release planned later this summer. 13 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  15. 15. IBM ResearchFuture Work: Cutting Out The Middle Man App Brasil Brasil Brasil App Kernel Kernel KernelInitiating Terminal or Node Parent or Gateway Node(s) Compute Node App App Kernel Kernel Kernel 14 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  16. 16. IBM ResearchFuture Work: Splice Optimizations via Direct Communication • Splice operations through name space is elegant, but transitive mounting of name space means that data has to flow through hierarchy in order to be spliced from one element to another • A direct communication path would be much more desirable, particularly on BG/P where hierarchy is constructed on tree, but torus is the preferred node-to-node data transport • Solution: add a layer of indirection to splice communication • add a file to the session directories which contains a list of “locations” for this node which essentially gives a recipe for connecting directly to the channel in order of performance. If the source node cannot use any of these recipes he will default to the name space path. • example: % cat ./mpoint/csrv/parent/c4/local/0/location torus!1.3.4!564 tree!0.23!564 tcp!10.1.3.4!564 • in the simplest embodiment we mount the namespace of the node in order to access its interfaces, but we also want to play with direct connections for data and/or potentially hiding everything inside the csrv interface so the mpipe splice code (and end-users) can be more or less ignorant of how the resources are accessed 15 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  17. 17. IBM ResearchFuture Work: • Turn fan-out/fan-in/many-to-many pipeline operations into a system primitives with TYPE similar syntax to existing pipe primitives but with semantics of multipipe • respect record boundries SIZE • allow broadcast • allow enumerated specification of destination • support splice operations DESTINATION • Build out rest of brasil infrastructure based on this new primitve PARAMETERS • All stdio channels are multipipes • allow record buffers for decoupled performance • All ctl channels implemented on top of multipipes • etc. • Same model could potentially be used for implementation of collective operations and barriers within a distributed file system pwrite(pipefd, buf, sz, ~(0)); namespace 16 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  18. 18. IBM Research Brasil Code Available: http://www.bitbucket.org/ericvh/hare/usr/brasil New Version In Progress: http://www.bitbucket.org/ericvh/hare/sys/src/cmd/uem http://goo.gl/5eFB This project is supported in part by the U.S. Department of Energy under http://www.research.ibm.com/austin Award Number DE-FG02- 08ER2585117 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  19. 19. IBM ResearchCore Concept: BRASILBasic Resource Aggregate System Inferno Layer •Stripped down Inferno - No GUI or anything we can live without, minimal footprint •Runs as a daemon (no console), all interaction via 9p mounts of its namespace •Different modes •default (exports /srv/brasil or on a tcp!127.0.0.1!5670) •gateway (exports over standard I/O - to be used by ssh initialization) •terminal (initiates ssh connection and starts a gateway) •Runs EVERYWHERE •User’s workstation •Surveyor Login Nodes •I/O Nodes •Compute Nodes 18 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  20. 20. IBM ResearchPreferred Embodiment: BRASIL Desktop Extension Model CPU ssh-duct workstation login node I/O CPU •Setup •User starts brasild on workstation •brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv •User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac) •Boot •User runs anl/run script on workstation •script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv •when CPU nodes boot they will connect to csrv on I/O node •Task Execution •User runs anl/exec script on workstation to run app •script reserves x nodes for app using taskfs •taskfs on workstation aggregates execution by using taskfs running on I/O nodes 19 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  21. 21. IBM ResearchCore Concept: Central Services •Establish hierarchical namespace on cluster services /csrv/ of •criswell) remote servers based reference (ie. cd Automount c3 •Export local services for use elsewhere within the network /csrv /csrv t /local /local /L /l2 /local /local /l1 /c4 /local /local L /c1 /L /local /local /c2 /t I1 I2 /local /local /l2 /l1 /local /local c1 c2 c3 c4 /c3 /c1 /local /local /c4 /c2 /local /local 20 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  22. 22. IBM ResearchOur Approach: Workload Optimized Distribution 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  23. 23. IBM ResearchOur Approach: Workload Optimized Distribution Desktop Extension 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  24. 24. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# Desktop Extension PUSH Pipeline Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  25. 25. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# local service proxy service aggregate service !#$% !#(# !#$% !#$% !#(# !#(# local service !#$% !#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace and Distributed Service Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  26. 26. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# local service proxy service aggregate service !#$% !#(# !#$% !#$% !#(# !#(# local service !#$% !#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace Scaling and Distributed Service Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  27. 27. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# local service proxy service aggregate service !#$% !#(# !#$% !#$% !#(# !#(# local service !#$% !#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace Scaling and Reliability Distributed Service Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  28. 28. Evaluations:Deployment and aggregation time IBM Research “Old” Performance Graph (from USENIX)PU3md Worknionsnces 22 ROSS Workshop XCPU3 05/31/2011 © 2011 IBM Corporation
  29. 29. IBM ResearchProblem: Limitations of Traditional Pipes AA A ABABAB B BB AA A A BA BA B BB B 23 © 2010 IBM Corporation
  30. 30. IBM ResearchLong Packet Pipes AA A AAA BBB B BB AA A B BB AA B BB A 24 © 2010 IBM Corporation
  31. 31. IBM ResearchEnumerated Pipes AB AB A 1:A 2:B B 25 © 2010 IBM Corporation
  32. 32. IBM ResearchCollective Pipes A Broadcast A A B Reduce(+) (B+C) C A B Allreduce(+) (A+B+C) C 26 © 2010 IBM Corporation
  33. 33. IBM ResearchSplicing Pipes spliceto(b) a b = a b splicefrom(b) a b = a b 27 © 2010 IBM Corporation
  34. 34. IBM ResearchExample Simple Invocation •mpipefs •mount /srv/mpipe /n/testpipe •ls -l /n/testpipe --rw-rw-rw- M 24 ericvh ericvh 0 Oct 10 18:10 /n/testpipe/data •echo hello /n/testpipe/data •cat /n/testpipe/data hello 28 © 2010 IBM Corporation
  35. 35. IBM ResearchPassing Arguments via aname •mount /srv/mpipe /n/test othername •ls -l /n/test --rw-rw-rw- M 26 ericvh ericvh 0 Oct 10 18:12 /n/test/otherpipe •mount /srv/mpipe /n/test2 -b bcastpipe •mount /srv/mpipe /n/test3 -e 5 enumpipe •....you get the idea, read the man page for more details 29 © 2010 IBM Corporation
  36. 36. IBM ResearchExample for writing control blocks int pipewrite(int fd, char *data, ulong size, ulong which) { int n; char hdr[255]; ulong tag = ~0; char pkttype=p; /* header byte is at offset ~0 */ n = snprint(hdr, 31, %cn%ludn%ludnn, pkttype, size, which); n = pwrite(fd, hdr, n+1, tag); if(n = 0) return n; return write(fd, data, size); } 30 © 2010 IBM Corporation
  37. 37. IBM ResearchLarger Example (execfs) /proc /clone /### /stdin mount -a /srv/mpipe /proc/### stdin /stdout mount -a /srv/mpipe /proc/### stdout /stderr mount -a /srv/mpipe /proc/### stderr /args /ctl /fd /fpregs /kregs /mem /note /noteid /notepg /ns /proc /profile /regs /segment /status /text /wait 31 © 2010 IBM Corporation
  38. 38. IBM ResearchReally Large Example (gangfs) /proc /gclone /status /g### /stdin mount -a /srv/mpipe /proc/### -b stdin /stdout mount -a /srv/mpipe /proc/### stdout /stderr mount -a /srv/mpipe /proc/### stderr /ctl /ns /status /wait ...and then, post exec from gangfs clone - execfs stdins are splicedfrom g#/stdin ...and then execfs stdouts and stderrs are splicedto g#/stdout and g#/stderr ...and you can do -e # with stdin to get enumerated instead of brodcast pipes 32 © 2010 IBM Corporation

×