Your SlideShare is downloading. ×

Brasil Ross 2011

492
views

Published on

Basic Resource Aggregate Interface Layer. ACM ROSS 2011 Workshop presentation.

Basic Resource Aggregate Interface Layer. ACM ROSS 2011 Workshop presentation.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
492
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. BRASILBasic Resource Aggregation Interface Layer Eric Van Hensbergen Pravin Shinde (now at ETH Zurich) (bergevan@us.ibm.com) Noah Evans (now at Bell-Labs) IBM Research Austin ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 2. IBM ResearchMotivationData Flow traditional formulation.for a portionProblems procedure using a Oriented HPC Figure 7: The data dependency graph of the Hartree-Fock 64,000 Node Torus 12 2 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 3. IBM ResearchPUSH Pipelines UNIX Model a|b|c PUSH Model a |< b >| c For more detail: refer to PODC09 Short Paper on PUSH Dataflow Shell 3 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 4. IBM ResearchPUSH Implementation shell Pipe Pipe command Pipe shell Pipe command Pipe shell Pipe command shell shell Pipe Multiplexor Demultiplexor Pipe command command Pipe shell Pipe command Pipe shell Pipe command shell Pipe command Pipe 4 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 5. IBM ResearchBack to Motivation: 64,000 Nodes t L I1 I2 c1 c2 c3 c4 5 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 6. IBM ResearchRelated Work/Background Work • MPI • “A Compositional Environment for MPI Programs” • Hadoop & Other Map/Reduce Solutions • cpu (from Plan 9) - http://plan9.bell-labs.com/magic/man2html/1/cpu • xcpu (from LANL) - http://www.xcpu.org/ • xcpu2 (from LANL) 6 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 7. IBM ResearchFile System Interfaces Control File Syntax • reserve [n] [os] [arch] • dir [wdir] &($" !&* !&0 • exec [command] [args] !&* • kill !$. !$. • killonclose !"#$% • nice [n] !)*+ !)*+ !)*+ !*, • splice [path] !*, !*, !"#2, !"#2, Environment Syntax !-, !3"4. • [key] = [value] !*). !3"4. !,."./, !,."./, Namespace Syntax !,."./, !,.54* • mount [-abcC] [server] [mnt] !$(*) !,.54* !,.5(/. • bind [-abcC] [new] [old] !&0 !,.5(/. • import [-abc] [host] [path] !,.54( !,.54( • cd [dir] !&1 • unmount [new] [old] !&* !&0 • clear !&* • . [path] 6($"!7),(/#$), 8),,4(* 8/9:8),,4(* 7 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 8. IBM ResearchDistribution and Aggregation 5//6+!)708 5//6,!)708 "%() "%() 5//6+ !"#$%&# !"#$%&# !"* !"/&(012!3#,4 !"#$%&# !"#$%&# 2 !"#+ !"%. !"#$%&# !"#$%&# * !"%+ !"/&(012!3*4 !"#$%&# !"#$%&# !"%, !"#+ #+ #, !"#$%&# !"#$%&# %+ %, %- %. !"#, !"%+ !"#$%&# !"#$%&# !"%- !"%, 5//6, !"#$%&# !"#$%&# !"%. !"/&(012!324 !"#$%&# !"#$%&# 8 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 9. IBM ResearchCommand Line Interface(s) • Direct File Interaction echo “res 2” > ./mpoint/csrv/local/0/ctl echo “exec date” > ./mpoint/csrv/local/0/0/ctl echo “exec wc” > cat ./mpoint/csrv/local/0/1/ctl echo “xsplice 0 1” > ./mpoint/csrv/0/local/0/ctl cat ./mpoint/csrv/0/local/0/stdio • Python Command Line runJob.py -n 4 /bin/date • PUSH ORS=blm find . -type f |< xargs chasen | sort | uniq -c >| sort -rn 9 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 10. IBM ResearchEvaluation Setup
  • 11. 10 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 12. IBM Research )*+,-./012/3.,-14+.56 7*68196/*5,96/:3;6. !/513.=Simple Micro-benchmark: Job start times !# ! !( 19=.?..,*5@ A.B+*5;6*15 C@@B.@;6*15 $ DE.F96*15 G.=.BH;6*15A*+. % # ( ! # $ !% %# !$ % ! !(# (#$ ! ! !#$%()%*%+,-(./0%1%/,%2 11 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 13. IBM Research )*+,-./0*12.34, 5647,63/84,95:;Job Completion Times with I/O !,3*.= ( !$ !% !# ! *8=.?../63@ A.B263C46*3 D@@B.@C46*3 E3/84 !(A62. FG.:846*3 H.=.BIC46*3 $ % # ( ! # $ !% %# !$ % ! !(# (#$ ! ! JK*+= 12 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 14. IBM ResearchDiscussion • Non-buffered communications channels are great for synchronization, but bad for performance as aggregation points become serialization points. • Pushing multiplexing to PUSH a mistake • adds additional copies of data and additional context switches for I/O to do record separation • by pushing them into the infrastructure we can go from 6 copies and context switches down to 2 for each pipeline stage • “beltway buffers” and other techniques may reduce this further or eliminate copies altogether • Transitive mounts of namespace an elegant way to bypass network segmentation -- but it also incurs lots of overhead • Allowing splice inside a reservation has dubious usefulness. It seems like it might be more useful to allow for individual elements to be spawned outside a reservation. • Fan-in and Fan-out are too limiting of a use case. What about deterministic delivery? What about many-to-many pipeline components? What about collectives and barriers? • Fault tolerance fault debugging properties were poor -- no channel for out-of-band communication of error or logging information. When transitive mounts went down, the system wedged -- but no integrated method of figuring out where the system went down. • File systems and communication are outside the scope of this work, but still a sore point for performance with the Plan 9 kernel on Blue Gene. Both issues are actively being worked on with a release planned later this summer. 13 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 15. IBM ResearchFuture Work: Cutting Out The Middle Man App Brasil Brasil Brasil App Kernel Kernel KernelInitiating Terminal or Node Parent or Gateway Node(s) Compute Node App App Kernel Kernel Kernel 14 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 16. IBM ResearchFuture Work: Splice Optimizations via Direct Communication • Splice operations through name space is elegant, but transitive mounting of name space means that data has to flow through hierarchy in order to be spliced from one element to another • A direct communication path would be much more desirable, particularly on BG/P where hierarchy is constructed on tree, but torus is the preferred node-to-node data transport • Solution: add a layer of indirection to splice communication • add a file to the session directories which contains a list of “locations” for this node which essentially gives a recipe for connecting directly to the channel in order of performance. If the source node cannot use any of these recipes he will default to the name space path. • example: % cat ./mpoint/csrv/parent/c4/local/0/location torus!1.3.4!564 tree!0.23!564 tcp!10.1.3.4!564 • in the simplest embodiment we mount the namespace of the node in order to access its interfaces, but we also want to play with direct connections for data and/or potentially hiding everything inside the csrv interface so the mpipe splice code (and end-users) can be more or less ignorant of how the resources are accessed 15 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 17. IBM ResearchFuture Work: • Turn fan-out/fan-in/many-to-many pipeline operations into a system primitives with TYPE similar syntax to existing pipe primitives but with semantics of multipipe • respect record boundries SIZE • allow broadcast • allow enumerated specification of destination • support splice operations DESTINATION • Build out rest of brasil infrastructure based on this new primitve PARAMETERS • All stdio channels are multipipes • allow record buffers for decoupled performance • All ctl channels implemented on top of multipipes • etc. • Same model could potentially be used for implementation of collective operations and barriers within a distributed file system pwrite(pipefd, buf, sz, ~(0)); namespace 16 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 18. IBM Research Brasil Code Available: http://www.bitbucket.org/ericvh/hare/usr/brasil New Version In Progress: http://www.bitbucket.org/ericvh/hare/sys/src/cmd/uem http://goo.gl/5eFB This project is supported in part by the U.S. Department of Energy under http://www.research.ibm.com/austin Award Number DE-FG02- 08ER2585117 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 19. IBM ResearchCore Concept: BRASILBasic Resource Aggregate System Inferno Layer •Stripped down Inferno - No GUI or anything we can live without, minimal footprint •Runs as a daemon (no console), all interaction via 9p mounts of its namespace •Different modes •default (exports /srv/brasil or on a tcp!127.0.0.1!5670) •gateway (exports over standard I/O - to be used by ssh initialization) •terminal (initiates ssh connection and starts a gateway) •Runs EVERYWHERE •User’s workstation •Surveyor Login Nodes •I/O Nodes •Compute Nodes 18 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 20. IBM ResearchPreferred Embodiment: BRASIL Desktop Extension Model CPU ssh-duct workstation login node I/O CPU •Setup •User starts brasild on workstation •brasild ssh’s to login node and starts another brasil hooking the two together with 27b-6 and mount resources in /csrv •User mounts brasild on workstation into namespace using 9pfuse or v9fs (or can mount from Plan 9 peer node, 9vx, p9p or ACME-sac) •Boot •User runs anl/run script on workstation •script interacts with taskfs on login node to start cobalt qsub •when I/O nodes boot it will connect its csrv to login csrv •when CPU nodes boot they will connect to csrv on I/O node •Task Execution •User runs anl/exec script on workstation to run app •script reserves x nodes for app using taskfs •taskfs on workstation aggregates execution by using taskfs running on I/O nodes 19 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 21. IBM ResearchCore Concept: Central Services •Establish hierarchical namespace on cluster services /csrv/ of •criswell) remote servers based reference (ie. cd Automount c3 •Export local services for use elsewhere within the network /csrv /csrv t /local /local /L /l2 /local /local /l1 /c4 /local /local L /c1 /L /local /local /c2 /t I1 I2 /local /local /l2 /l1 /local /local c1 c2 c3 c4 /c3 /c1 /local /local /c4 /c2 /local /local 20 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 22. IBM ResearchOur Approach: Workload Optimized Distribution 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 23. IBM ResearchOur Approach: Workload Optimized Distribution Desktop Extension 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 24. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# Desktop Extension PUSH Pipeline Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 25. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# local service proxy service aggregate service !#$% !#(# !#$% !#$% !#(# !#(# local service !#$% !#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace and Distributed Service Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 26. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# local service proxy service aggregate service !#$% !#(# !#$% !#$% !#(# !#(# local service !#$% !#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace Scaling and Distributed Service Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 27. IBM ResearchOur Approach: Workload Optimized Distribution !#$% !#(# !#$% !#$% !#(# !#(# !#$% !#(# !#$% !#(# !#$% !#$% !#$% !#(# !#(# !#(# !#$% !#(# local service proxy service aggregate service !#$% !#(# !#$% !#$% !#(# !#(# local service !#$% !#(# Desktop Extension PUSH Pipeline Model remote services Aggregation Via Dynamic Namespace Scaling and Reliability Distributed Service Model 21 ROSS Workshop 05/31/2011 © 2011 IBM Corporation
  • 28. Evaluations:Deployment and aggregation time IBM Research “Old” Performance Graph (from USENIX)PU3md Worknionsnces 22 ROSS Workshop XCPU3 05/31/2011 © 2011 IBM Corporation
  • 29. IBM ResearchProblem: Limitations of Traditional Pipes AA A ABABAB B BB AA A A BA BA B BB B 23 © 2010 IBM Corporation
  • 30. IBM ResearchLong Packet Pipes AA A AAA BBB B BB AA A B BB AA B BB A 24 © 2010 IBM Corporation
  • 31. IBM ResearchEnumerated Pipes AB AB A 1:A 2:B B 25 © 2010 IBM Corporation
  • 32. IBM ResearchCollective Pipes A Broadcast A A B Reduce(+) (B+C) C A B Allreduce(+) (A+B+C) C 26 © 2010 IBM Corporation
  • 33. IBM ResearchSplicing Pipes spliceto(b) a b = a b splicefrom(b) a b = a b 27 © 2010 IBM Corporation
  • 34. IBM ResearchExample Simple Invocation •mpipefs •mount /srv/mpipe /n/testpipe •ls -l /n/testpipe --rw-rw-rw- M 24 ericvh ericvh 0 Oct 10 18:10 /n/testpipe/data •echo hello /n/testpipe/data •cat /n/testpipe/data hello 28 © 2010 IBM Corporation
  • 35. IBM ResearchPassing Arguments via aname •mount /srv/mpipe /n/test othername •ls -l /n/test --rw-rw-rw- M 26 ericvh ericvh 0 Oct 10 18:12 /n/test/otherpipe •mount /srv/mpipe /n/test2 -b bcastpipe •mount /srv/mpipe /n/test3 -e 5 enumpipe •....you get the idea, read the man page for more details 29 © 2010 IBM Corporation
  • 36. IBM ResearchExample for writing control blocks int pipewrite(int fd, char *data, ulong size, ulong which) { int n; char hdr[255]; ulong tag = ~0; char pkttype=p; /* header byte is at offset ~0 */ n = snprint(hdr, 31, %cn%ludn%ludnn, pkttype, size, which); n = pwrite(fd, hdr, n+1, tag); if(n = 0) return n; return write(fd, data, size); } 30 © 2010 IBM Corporation
  • 37. IBM ResearchLarger Example (execfs) /proc /clone /### /stdin mount -a /srv/mpipe /proc/### stdin /stdout mount -a /srv/mpipe /proc/### stdout /stderr mount -a /srv/mpipe /proc/### stderr /args /ctl /fd /fpregs /kregs /mem /note /noteid /notepg /ns /proc /profile /regs /segment /status /text /wait 31 © 2010 IBM Corporation
  • 38. IBM ResearchReally Large Example (gangfs) /proc /gclone /status /g### /stdin mount -a /srv/mpipe /proc/### -b stdin /stdout mount -a /srv/mpipe /proc/### stdout /stderr mount -a /srv/mpipe /proc/### stderr /ctl /ns /status /wait ...and then, post exec from gangfs clone - execfs stdins are splicedfrom g#/stdin ...and then execfs stdouts and stderrs are splicedto g#/stdout and g#/stderr ...and you can do -e # with stdin to get enumerated instead of brodcast pipes 32 © 2010 IBM Corporation