A DSEL for Addressing the Problems Posed by Parallel                        Architectures                                 ...
that the compiler and language based approaches have been           • It shall assist in debugging any use of a conforming...
Clearly the specific implementation of these                     • The sequential_mode has been provided to         could a...
to be done in user code. This has not been                  (b) or implements the function process(result_type           a...
4.1.2    Operators on the thread-pool-type                                     • The style and arguments of the data-paral...
Proof. From the definitions of the DSEL, the transfer         to unlock D. In terms of the DSEL, this implies that exe-of w...
struct res t {                                                                                                5. CONCLUSIO...
[5] El-ghazawi, T. A., Carlson, W. W., and     Draper, J. M. UPC language specifications v1.1.1.     Tech. rep., 2003. [6] ...
Upcoming SlideShare
Loading in …5

A DSEL for Addressing the Problems Posed by Parallel Architectures


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A DSEL for Addressing the Problems Posed by Parallel Architectures

  1. 1. A DSEL for Addressing the Problems Posed by Parallel Architectures Jason Mc Guiness, Colin EganCTCA, School of Computer Science, University of Read EW-PRAM [19]) computation model. Furthermore theHertfordshire Hatfield, Hertfordshire, UK DSEL described assists the user with regard to debuggingoverload@hussar.demon.co.uk the resultant parallel program. An implementation of the DSEL in C++ exists: further details may be found in [12]. 2. RELATED WORK1. INTRODUCTION From a hardware perspective, the evolution of computer Computers with multiple pipelines have become increas- architectures has been heavily influenced by the von Neu-ingly prevalent, hence a rise in the available parallelism to mann model. This has meant that with the relative increasethe programming community. For example the dual-core in processor speed vs. memory speed, the introduction ofdesktop workstations to multiple core, multiple processors memory hierarchies [3] and out-of-order instruction schedul-within blade frames which may contain hundreds of pipelines ing has been highly successful. However, these extra levelsin data centres, to state-of-the-art mainframes in the Top500 increase the penalty associated with a miss in the memory-supercomputer list with thousands of cores and the poten- subsystem, due to memory-access times, limiting the ILPtial arrival of next-generation cellular architectures that may (Instruction-Level Parallelism). Also there may be an in-have millions of cores. This surfeit of hardware parallelism crease in design complexity and power consumption of thehas apparently yet to be tamed in the software architecture overall system. An approach to avoid this problem may bearena. Various attempts to meet this challenge have been to fetch sets of instructions from different memory banks, i.e.made over the decades, taking such approaches as languages, introduce threads, which would allow an increase in ILP, incompilers or libraries to enable programmers to enhance the proportion to the number of executing threads.parallelism within their various problem domains. Yet the From a software perspective, the challenge that has beencommon folklore in computer science has still been that it presented to programmers by these parallel architectures hasis hard to program parallel algorithms correctly. been the massive parallelism they expose. There has been This paper examines what language features would be re- much work done in the field of parallelizing software:quired to add to an existing imperative language that wouldhave little if no native support for implementing parallel- • Auto-parallelizing compilers: such as EARTH-C [17].ism apart from a simple library that exposes the OS-level Much of the work developing auto-parallelizing com-threading primitives. The goal of the authors has been to pilers has derived from the data-flow community [16].create a minimal and orthogonal DSEL that would add the • Language support: such as Erlang [20], UPC [5] orcapabilities of parallelism to that target language. Moreover Intel’s [18] and Microsoft’s C++ compilers based uponthe DSEL proposed will be demonstrated to have such use- OpenMP.ful guarantees as a correct, heuristically efficient schedule.In terms of correctness the DSEL provides guarantees that • Library support: such as POSIX threads (pthreads) orit can provide deadlock-free and race-condition free sched- Win32, MPI, OpenMP, Boost, Intel’s TBB [14], Cilkules. In terms of efficiency, the schedule produced will be [10] or various libraries targeting C++ [6, 2]. Intel’sshown to add no worse than a poly-logarithmic order to the TBB has higher-level threading constructs, but it hasalgorithmic run-time of the schedule of the program on a not supplied parallel algorithms, nor has it providedCREW-PRAM (Concurrent-Read, Exclusive-Write, Paral- any guarantees regarding its library. It also sufferslel Random-Access Machine[19]) or EREW-PRAM (Exclusive- from mixing code relating to generating the parallel schedule and the business logic, which would also make testing more complex. These have all had varying levels of success, as discussed in part in [11], with regards to addressing the issues of pro-Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are gramming effectively for such parallel architectures.not made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, to 3. MOTIVATIONrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. The basic issues addressed by all of these approaches haveCopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. been: correctness or optimization. So far it has appeared
  2. 2. that the compiler and language based approaches have been • It shall assist in debugging any use of a conformingthe only approaches able to address both of those issues implementation.together. But the language-based approaches require thatprogrammers would need to re-implement their programs in • It should provide guarantees regarding those bane’s ofa potentially novel language, a change that has been very parallel programming: dead-locks and race-conditions.hard for business to adopt, severely limiting the use of these • Moreover it should provide guarantees regarding theapproaches. algorithmic complexity of any parallel schedule it would Amongst the criticisms raised regarding the use of libraries generate.[11, 13] such as pthreads, Win32 or OpenMP have been: Initially a description of the grammar will be given, followed • They have been too low-level so using them to write by a discussion of some of the properties of the DSEL. Fi- correct multi-threaded programs has been very hard; nally some theoretical results derived from the grammar of it suffers from composition problems. This problem the DSEL will be given. may be summarized as: atomic access to an object would be contained within each object (using classic 4.1 Detailed Grammar of the DSEL OOD), thus when composing multiple objects, mul- The various types, production rules and operations that tiple separate locks, from the different objects, have define the DSEL will be given in this section. The basic to be manipulated to guarantee correct access. If this types will be defined first, then the operations upon those were done correctly the usual outcome has been a ser- types will be defined. C++ has been chosen as the target ious reduction in scalability. language in which to implement the DSEL. This was due • A related issue has been that that the programmer to the rich ability within C++ to extend the type system often intimately entangles their thread-safety, thread at compile-time: primarily using templates but also over- scheduling and the business logic of their code. This loading various operators. Hence the presentation of the means that each program would be effectively a be- grammar relies on the grammar of C++, so it would assist spoke program, requiring re-testing of each program the reader to have familiarity of that grammar, in particular for threading issues as well as business logic issues. Annex A of the ISO C++ Standard [8]. Although C++11 has some support for threading, this had not been widely im- • Also debugging such code has been found to be very plemented at the time of writing, moreover the specification hard. Debuggers for multi-threaded code have been an had not addressed the points of the DSEL in this paper. open area of research for some time. Some clarifications:Given that the language has to be immutable, a DSEL defined • The subscriptopt means that the keyword is optional.by a library that attempts to support the correctness andoptimality of the language and compiler approaches and • The subscriptdef means that the keyword is default andyet somehow overcomes the limitations of the usual library- specifies the default value for the optional keyword.based approaches would seem to be ideal. This DSEL willnow be presented. 4.1.1 Types The primary types used within the DSEL are derived from4. THE DSEL TO ASSIST PARALLELISM the thread-pool type. We chose to address these issues by defining a carefully 1. Thread pools can be composed with various subtypescrafted DSEL, then examining it’s properties to demonstrate that could be used to fundamentally affect the imple-that the DSEL achieved the goals. The DSEL should have mentation and performance of any client software:the following properties: • The DSEL shall target what may be termed as gen- thread-pool-type: eral purpose threading, the authors define this to be thread_pool work-policy size-policy pool-adaptor scheduling in which the conditions or loop-bounds may • A thread pool would contain a collection of not be computed at compile-time, nor could they be threads that may be more, less or the same as represented as monads, so could not be memoized1 . In the number of processors on the target archi- particular the DSEL shall support both data-flow and tecture. This allows for implementations to data parallel constructs. visualize the multiple cores available or make • By being implemented in an existing language it would use of operating-system provided thread im- avoid the necessity of re-implementing the programs, a plementations. An implementation may choose more progressive approach to adoption could be taken. to enforce a synchronization of all threads within the pool once an instance of that pool • It shall be a reasonably small DSEL, but be large should be destroyed, to ensure that threads enough provide sufficient extensions to the host lan- managed by the pool are appropriately des- guage that express parallel constructs in a manner that troyed and work in the process of mutation would be natural to a programmer using that language. could be appropriately terminated.1 A compile or run-time optimisation technique involving a work-policy: one ofspace-time tradeoff. Re-computation of pure functions when worker_threads_get_work one_thread_distributesprovided with the same arguments may be avoided by cach-ing the result; the result will be the same for each call with • The library should implement the classic work-the same arguments, if the function has no side-effects. stealing or master-slave work sharing algorithms.
  3. 3. Clearly the specific implementation of these • The sequential_mode has been provided to could affect the internal queue containing un- allow implementations to removal all thread- processed work within the thread_pool. For ing aspects of all of the implementing library, example a worker_threads_get_work queue which would hugely reduce the burden on the might be implemented such that the addition programmer regarding identifying bugs within of work would be independent to the removal their code. If all threading is removed, then of work. all bugs that remain, in principle should reside in their user-code, which once determined tosize-policy: one of be bug-free, could then be trivially parallel- fixed_size tracks_to_max infinite ized by modifying this single specifier and re- • The size-policy when used in combination with compiling. Then any further bugs introduced the threading-model could be used to make would be due to bugs within the parallel as- considerable simplifications in the implement- pects of their code, or the library implement- ation of the thread-pool-type which could make ing this DSEL. If the user relies upon the lib- it faster on certain architectures. rary to provide threading, then there should be no further bugs in their code. We consider • tracks_to_max would implement some model this feature of paramount importance, as it of the cost of re-creating and maintaining threads. directly addresses the complex task of debug- If thread were cheap to create & destroy with ging parallel software, by separating the al- little overhead, then an infinite size might gorithm by which the parallelism should be be a reasonable approximation, conversely threads implemented from the code implementing the with opposite characteristics might be better mutations on the data. maintained in a fixed_size pool. priority-mode: one ofpool-adaptor: normal_fifodef prioritized_queue joinability api-type threading-model priority-modeopt comparatoropt GSS(k)-batch-sizeopt • This is an optional parameter. The prior- itized_queue would allow the user to spe-joinability: one of cify whether specific instances of work to be joinable nonjoinable mutated should be performed ahead of other • The joinability has been provided to allow instances of work, according to a user-specified for certain optimizations to be implement- comparator . able. A thread-pool-type that is nonjoinable comparator: could have a number of simplifying details std::lessdef that would make it not only easier to imple- • A unary function-type that specifies a strict ment but also faster in operation. weak-ordering on the elements within a theapi-type: one of prioritized_queue. no_api MS_Win32 posix_pthreads IBM_cyclops GSS(k)-batch-size: • Both MS_Win32 and posix_pthreads are ex- 1def amples of heavyweight_threading APIs in • A natural number specifying the batch-size which threading at the OS-level would be made to be used within the queue specified by the use of to implement the DSEL. IBM_cyclops priority-mode. The default is 1, i.e. no batch- would be an implementation of the DSEL us- ing would be performed. An implementa- ing the lightweight_threading API imple- tion would be likely to use this for enabling mented by IBM BlueGene/C Cyclops [1]. GSS(k) scheduling [9].threading-model: one of 2. Adapted collections to assist in providing thread-safety sequential_mode heavyweight_threading and also specify the memory access model of the col- lightweight_threading lection: • This specifier provides a coarse representa- tion of the various implementations of thread- safe-colln: able construct in the multitude of architec- safe_colln collection-type lock-type tures available. For example Pthreads would • This adaptor wraps the collection-type and be considered to be heavyweight_threading an instance of lock-type in one object, and whereas Cyclops would be lightweight_threading. provides a few thread-safe operations upon Separation of the threading model versus the that collection, plus access to the underlying API allows for the possibility that there may collection. This access might seem surpris- be multiple threading APIs on the same plat- ing, but this has been done because locking form, which may have different properties, for the operations on collections has been shown example if there were to be a GPU available to not be composable, and cross-cuts both in a multi-core computer, there could be two object-orientated and functional-decomposition different threading models within the same designs. This could could be open to misuse, program. but otherwise excessive locking would have
  4. 4. to be done in user code. This has not been (b) or implements the function process(result_type an ideal design decision, but a simple one, &), and the library may determine the actual with scope for future work. Note that this type of result_type. design choice within the DSEL does not inval- The sub-types are: idate the rest of the grammar, as this would just affect the overloads to the data-parallel- joinable: algorithms, described later. A method of transferring work to be mutated • The adaptor also provides access to both read- into an instance of thread-pool-types. If the lock and write-lock types, which may be the work to be mutated were to be transferred same, but allow the user to specify the intent using this modifier, then the return result of of their operations more clearly. the transfer would be an execution_context, that may subsequently be used to obtain the lock-type: one of result of the mutation. Note that this implies critical_section_lock_type read_write that the DSEL implements a form of data- read_decaying_write flow operation. (a) A critical_section_lock_type would be a execution_context: single-reader, single-writer lock, a simulation This is the type of future that a transfer re- of EREW semantics. The implementation of turns. It is also a type of proxy to the res- this type of lock could be more efficient on ult_type that the mutation returns. Access certain architectures. via this proxy implicitly causes the calling (b) A read_write lock is a multi-readers, single- thread to wait until the mutation has been write lock, a simulation of CREW semantics. completed. This is the other component of (c) A read_decaying_write lock would be a spe- the DSEL that implements the data-flow model. cialization of a read_write lock that also im- Various sub-types of execution_context ex- plements atomic transformation of a write- ist specific to the result_types of the vari- lock into a read-lock. ous operations that the DSEL supports. Note (d) The lock should be used to govern the opera- that the implementation of execution_context tions on the collection, and not operations on should specifically prohibit aliasing instances the items contained within the collection. of these types, copying instances of these types • The lock-type parameter may be used to spe- and assigning instances of these types. cify if EREW or CREW operations upon the nonjoinable: collection are allowed. For example if EREW Another method of transferring work to be mutated operations are only allowed, then overlapped into an instance of thread-pool-types. If the work dereferences of the execution_context res- to be mutated were to be transferred using this ultant from parallel-algorithms operating upon modifier, then the return result of the transfer the same instance of a safe-colln should be would be nothing. The mutation within the pool strictly ordered by an implementation to en- would occur at some indeterminate time, the res- sure EREW semantics are maintained. Al- ult of which would, for example, be detectable by ternatively if CREW semantics were specified any side effects of the mutation within the res- then an implementation may allow read-operations ult_type of the work to be mutated. upon the same instance of the safe-colln to time_critical: occur in parallel, assuming they were not blocked This modifier ensures that when the work is mutated by a write operation. by a thread within an instance of thread-pool- collection-type: type into which it has been transferred, it will A standard collection such as an STL-style list or be executed at an implementation-defined higher vector, etc. kernel priority. Other similar modifiers exist in the DSEL for other kernel priorities. This ex-3. The thread-pool-type defines further sub-types for con- ample demonstrates that specifying other modifi- venience to the programmer: ers, that would be extensions to the DSEL, would create_direct: be possible. This adaptor, parametrized by the type of work cliques(natural_number n): to be mutated, contains certain sub-types. The This modifier is used with data-parallel-algorithms. input data and the mutation operation combined It causes the instance of thread-pool-type to allow are termed the work to be mutated, which would p the data-parallel-algorithm to operate with n be a type of closure. If the mutation operation threads, where p is the number of threads in the does not change the state of any data external to instance. the closure, then this would be a type of monad. More specifically, this work to be mutated should 4. The DSEL specifies a number of other utility types also be a type of functor that either: such as shared_pointer, various exception types and (a) Provides a type result_type to access the exception-management adaptors amongst others. The result of the mutation, and specifies the muta- details of these important, but ancillary types has been tion member-function, omitted for brevity.
  5. 5. 4.1.2 Operators on the thread-pool-type • The style and arguments of the data-parallel- The various operations that are defined in the DSEL will algorithms is similar to those of the STL innow be given. These operations tie together the types and the C++ ISO Standard. Specifically they allexpress the restrictions upon the generation of the control- take a safe-colln as the arguments to spe-flow graph that the DSEL may create. cify the ranges and functors as necessary as specified within the STL. Note that these al- 1. The transfer work to be mutated into an instance of gorithms all use run-time computed bounds, thread-pool-type is defined as follows: otherwise it would be more optimal to use techniques similar to those used in HPF or transfer-future: described in [9] to parallelize such operations. execution-context-resultopt If the DSEL supports loop-carried dependen- thread-pool-type transfer-operation cies in the functor argument is undefined. execution-context-result: • If algorithms were to be implemented using execution_context << techniques described in [7] and [4], then the • The token sequence “<<” is the transfer oper- algorithms would be optimal with O (log (p)) ation, and also used in the definition of the complexity in distributing the work to the transfer-modifier-operation, amongst other places. thread pool. Given that there are no loop- • Note how an execution_context can only be carried dependencies, each thread may oper- created via a transfer of work to be mutated ate independently upon a sub-range within into the suitably defined thread_pool. It is the safe-colln for an optimal algorithmic an error to transfer work into a thread_pool complexity of O n − 1 + log (p) where n is p that has been defined using the nonjoinable the number of items to be computed and p is subtype. There is no way to create an ex- the number of threads, ignoring the operation ecution_context with transferring work to time of the mutations. be mutated, so every execution_context is guaranteed to eventually contain the result of 3. The binary_funs are defined as follows: a mutation. transfer-operation: binary fun: transfer-modifier-operationopt transfer-data-operation work-to-be-mutated work-to-be-mutated transfer-modifier-operation: binary functor << transfer-modifier • A binary functor is just a functor that takes transfer-modifier: one of two arguments. The order of evaluation of time_critical joinable nonjoinable cliques the arguments is undefined. If the DSEL sup- transfer-data-operation: ports dependencies between the arguments is << transfer-data undefined. This would imply that the argu- transfer-data: one of ments should refrain from modifying any ex- work-to-be-mutated parallel-binary-operation data- ternal state. parallel-algorithm 4. Similarly, the logical operations are defined as follows:The details of the various parallel-binary-operations and data-parallel-algorithms will be given in the next section. logical operation: work-to-be-mutated work-to-be-mutated4.1.3 The Data-Parallel Operations and Algorithms binary functor This section will describe the the various parallel algorithms • Note that no short-circuiting of the compu-defined within the DSEL. tation of the arguments occurs. The result 1. The parallel-binary-operations are defined as follows: of mutating the arguments must be boolean. If the DSEL supports dependencies between parallel-binary-operation: one of the arguments is undefined. This would im- binary_fun parallel-logical-operation ply that the arguments should refrain from parallel-logical-operation: one of modifying any external state. logical_and logical_or • It is likely that an implementation would not 4.2 Properties of the DSEL implement the usual short-circuiting of the In this section some results will be presented that derive operands, to allow them to transferred into from the definitions, the first of which will demonstrate that the thread pool and executed in parallel. the CFG (Control Flow Graph) would be a tree from which the other useful results will directly derive. 2. The data-parallel-algorithms are defined as follows: Theorem 1. Using the DSEL described above, the par- data-parallel-algorithm: one of allel control-flow graph of any program that may use a con- accumulate copy count count_if fill fill_n forming implementation of the DSEL must be an acyclic dir- find find_if for_each min_element max_element ected graph, and comprised of at least one singly-rooted tree, reverse transform but may contain multiple singly-rooted, independent, trees.
  6. 6. Proof. From the definitions of the DSEL, the transfer to unlock D. In terms of the DSEL, this implies that exe-of work to be mutated into the thread_pool may be done cution_contexts C and D are shared between two threads.only once according to the definition of transfer-future the i.e. that an execution_context has been passed from aresult of which returns a single execution_context accord- node A to a sibling node B, and vice-versa occurs to exe-ing to the definition of execution-context-result which has cution_context B. But aliasing execution_contexts hasbeen the only defined way to create execution_contexts. been explicitly forbidden in the DSEL by definition 3.This implies that from a node in the CFG, each transfer tothe thread-pool-type represents a single forward-edge con- Corollary 1. If the user refrains from using any othernecting the execution_context with the child-node that threading-related items or atomic objects other than thosecontains the mutation. The back-edge from the mutation defined in the DSEL above and that the work they wish toto the parent-node is the edge connecting the result of the mutate may not be aliased by any other object, then the usermutation with the dereference of the execution_context. can be guaranteed to have a schedule free of race-conditionsThe execution_context and the dereference occur in the and deadlocks.same node, because execution_contexts cannot be passed Proof. It must be proven that the two theorems 2 and 3between nodes, by definition. In summary: the parent-node are not mutually exclusive. Let us suppose that a CFG ex-has an edge from the execution_context it contains to the ists that satisfies 2 but not 3. Therefore there must be eithermutation and a back-edge to the dereference in that parent- an edge formed by aliasing an execution_context or a back-node. Each node may perform none, one or more trans- edge from the result of a mutation back to a dereference offers resulting in none, one or more child-nodes. A node an execution_context. The former has been explicitly for-with no children is a leaf-node, containing only a mutation. bidden in the DSEL by definition of the execution_context,Now back-edges to multiple parent nodes cannot be created, 3, the latter forbidden by the definition of transfer-future, 1.according to the definition of execution_context, because Both are a contradiction, therefore such a CFG cannot exist.execution_contexts cannot be aliased nor copied between Therefore any conforming CFG must satisfy both theoremsnodes. So the only edges in this sub-graph are the forward 2 and 3.and back edges from parent to children. Therefore the sub- Theorem 4. If the user refrains from using any othergraph is not only acyclic, but a tree. Due to the definitions threading-related items or atomic objects other than thoseof transfer-future and execution-context-result, the only way defined in the DSEL above then the schedule of work to beto generate mutations is via the above technique. Therefore mutated by a conforming implementation of the DSEL wouldeach child-node either returns via the back edge immedi- be executed in time taking at least an algorithmic complexityately or generates a further sub-tree attaching to the larger of O (log (p)) and at most O (n) in units of time to mutatetree that contains it’s parent. Now if the entry-point of the work where n is the number of work items to be mutatedthe program is the single thread that runs main(), i.e. the on p processors. The algorithmic order of the minimal timesingle root, this can only generate a tree, and each node would be poly-logarithmic, so within NC, therefore at leastin the tree can only return or generate a tree, the whole optimal.CFG must be a tree. If there were more entry-points, eachone can only generate a tree per entry-point, as the execu- Proof. Given that the schedule must be a tree accordingtion_contexts cannot be aliased nor copied between nodes, to theorem 1 with at most n leaf-nodes, and that each nodeby definition. takes at most O n − 1 + log (p) computations according p to the definition of the parallel-algorithms. Also it has beenAccording to the above corollary, one may appreciate that a proven in [7] that to distribute n items of work onto p pro-conforming implementation of the DSEL would implement cessors may be performed with an algorithmic complexity ofdata-flow in software. O (log (n)). The fastest computation time would be if the Theorem 2. If the user refrains from using any other schedule were a balanced tree, where the computation timethreading-related items or atomic objects other than those would be the depth of the tree, i.e. O (log (n)) in the samedefined in the DSEL above then they can be guaranteed to units. If the n items of work were to be greater than thehave a schedule free of race-conditions. p processors, then O (log (p)) ≤ O (log (n)), so the compu- tation time would be slower than O (log (p)). The slowest Proof. A race-condition is when two threads attempt to computation time would be if the tree were a chain, i.e.access the same data at the same time. A race-condition O (n) time. In those cases this implies that a conformingin the CFG would be represented by a child node with two implementation should add at most a constant order to theparent nodes, with forward-edges connecting the parents to execution time of the schedule.the child. Note that the CFG must an acyclic tree accordingto theorem 1, then this sub-graph cannot be represented in 4.3 Some Example Usagea tree, so the schedule must be race-condition free. These are two toy examples, based upon an implement- ation in [12], of how the above DSEL might appear. The Theorem 3. If the user refrains from using any other first example is a data-flow example showing how the DSELthreading-related items or atomic objects other than those could be used to mutate some work on a thread within thedefined in the DSEL above and that the work they wish to thread pool, effectively demonstrating how the future wouldmutate may not be aliased by any other object, then the user be waited upon. Note how the execution_context has beencan be guaranteed to have a schedule free of deadlocks. created via the transfer of work into the thread_pool. Proof. A deadlock may be defined as: when threads Aand B wait on atomic-objects C and D, such that A locks Listing 1: Data-flow example of a Thread Pool andC, waits upon D to unlock C and B locks D, waits upon C Future.
  7. 7. struct res t { 5. CONCLUSIONS int i ;}; The goals of the paper has been achieved: a DSEL hass t r u c t work type { void process ( r e s t &) {} been formulated:};t y p e d e f ppd : : t h r e a d p o o l < • that may be used to expresses general-purpose paral- p o o l t r a i t s : : worker threads get work , pool traits :: fixed size , lelism within a language, pool adaptor< g e n e r i c t r a i t s : : joinable , platform api , heavyweight threading • ensures that there are no deadlocks and race condi- >> pool type ; tions within the program if the programmer restrictst y p e d e f p o o l t y p e : : c r e a t e d i r e c t <w o r k t y p e > c r e a t o r t ; themselves to using the constructs of the DSEL,typedef c r e a t o r t : : execution context execution context ;typedef c r e a t o r t : : joinable joinable ; • and does not preclude implementing optimal schedulespool type pool ( 2 ) ;e x e c u t i o n c o n t e x t c o n t e x t ( pool<<j o i n a b l e ()<< w o r k t y p e ( ) ) ; on a CREW-PRAM or EREW-PRAM computationcontext− >i ; model. The typedefs in this example implementation of the gram- Intuition suggests that this result should have come as nomar are complex, but the typedef for the thread-pool-type surprise considering the work done relating to auto-parallelizingwould only be needed once and, reasonably, could be held compilers, which work within the AST and CFGs of thein a configuration trait in header file. parsed program[17]. The second example shows how a data-parallel version ofthe C++ accumulate algorithm might appear. It is interesting to note that the results presented here would be applicable to all programming languages, com- piled or interpreted, and that one need not be forced toListing 2: Example of a parallel version of an STL re-implement a compiler. Moreover the DSEL has been de-algorithm. signed to directly address the issue of debugging any sucht y p e d e f ppd : : t h r e a d p o o l < p o o l t r a i t s : : worker threads get work , parallel program, directly addressing this problematic is- pool traits :: fixed size , pool adaptor< sue. Further advantages of this DSEL are that program- g e n e r i c t r a i t s : : joinable , platform api , mers would not need to learn an entirely new programming heavyweight threading , p o o l t r a i t s : : normal fifo , std : : less ,1 language, nor would they have to change to a novel com- >> pool type ; piler implementing the target language, which may not bet y p e d e f ppd : : s a f e c o l l n < v e c t o r <i n t >, l o c k t r a i t s : : c r i t i c a l s e c t i o n l o c k t y p e available, or if it were might be impossible to use for more> vtr colln t ; prosaic business reasons.typedef pool type : : accumulate t< vtr colln t>:: e x e c u t i o n c o n t e x t e x e c u t i o n c o n t e x t ;vtr colln t v;v . push back ( 1 ) ; v . push back ( 2 ) ; 6. FUTURE WORKexecution context context ( pool< <j o i n a b l e ( ) There are a number of avenues that arise which could be < <p o o l . a c c u m u l a t e ( investigated, for example a conforming implementation of v , 1 , s t d : : p l u s < v t r c o l l n t : : v a l u e t y p e >() ) the DSEL could be presented, for example [12]. The prop-);a s s e r t ( ∗ c o n t e x t ==4); erties of such an implementation could then be investigated by reimplementing a benchmark suite, such as SPEC2006 All of the parameters have been specified in the thread- [15], and comparing and contrasting the performance of thatpool-type to demonstrate the appearance of the typedef. Note implementation versus the literature. The definition of safe-that the example illustrates a map-reduce operation, an im- colln has not been an optimal design decision a better ap-plementation might: proach would have been to define ranges that support lock- ing upon the underlying collection. Extending the DSEL 1. take sub-ranges within the safe-colln, may be required to admit memoization could be investig- ated, such that a conforming implementation might imple- 2. which would be distributed across the threads within ment not only inter but intra-procedural analysis. the thread_pool, 7. REFERENCES 3. the mutations upon each element within each sub- range would be performed sequentially, their results [1] Almasi, G., Cascaval, C., Castanos, J. G., combined via the accumulator functor, without lock- Denneau, M., Lieber, D., Moreira, J. E., and ing any other thread’s operation, Henry S. Warren, J. Dissecting Cyclops: a detailed analysis of a multithreaded architecture. SIGARCH 4. These sub-results would be combined with the final ac- Comput. Archit. News 31, 1 (2003), 26–38. cumulation, in this the implementation providing suit- [2] Bischof, H., Gorlatch, S., Leshchinskiy, R., and able locking to avoid any race-condition, M¨ ller, J. Data Parallelism in C++ Template u Programs: a Barnes-hut Case Study. Parallel 5. The total result would be made available via the exe- Processing Letters 15, 3 (2005), 257–272. cution_context. [3] Burger, D., Goodman, J. R., and Kagi, A. Memory Bandwidth Limitations of FutureMoreover the size of the input collection should be suffi- Microprocessors. In ISCA (1996), pp. 78–89.ciently large or the time taken to execute the operation of [4] Casanova, H., Legrand, A., and Robert, Y.the accumulator so long, so that the cost of the above oper- Parallel Algorithms. Chapman & Hall/CRC Press,ations would be reasonably amortized. 2008.
  8. 8. [5] El-ghazawi, T. A., Carlson, W. W., and Draper, J. M. UPC language specifications v1.1.1. Tech. rep., 2003. [6] Giacaman, N., and Sinnen, O. Parallel iterator for parallelising object oriented applications. In SEPADS’08: Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems (Stevens Point, Wisconsin, USA, 2008), World Scientific and Engineering Academy and Society (WSEAS), pp. 44–49. [7] Gibbons, A., and Rytter, W. Efficient parallel algorithms. Cambridge University Press, New York, NY, USA, 1988. [8] ISO. ISO/IEC 14882:2011 Information technology — Programming languages — C++. International Organization for Standardization, Geneva, Switzerland, Feb. 2012. [9] Kennedy, K., and Allen, J. R. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.[10] Leiserson, C. E. The Cilk++ concurrency platform. J. Supercomput. 51, 3 (Mar. 2010), 244–257.[11] McGuiness, J. M. Automatic Code-Generation Techniques for Micro-Threaded RISC Architectures. Master’s thesis, University of Hertfordshire, Hatfield, Hertfordshire, UK, July 2006.[12] McGuiness, J. M. libjmmcg - implementing PPD. libjmmcg.sourceforge.net, July 2009.[13] McGuiness, J. M., Egan, C., Christianson, B., and Gao, G. The Challenges of Efficient Code-Generation for Massively Parallel Architectures. In Asia-Pacific Computer Systems Architecture Conference (2006), pp. 416–422.[14] Pheatt, C. Intel R threading building blocks. J. Comput. Small Coll. 23, 4 (2008), 298–298.[15] Reilly, J. Evolve or Die: Making SPEC’s CPU Suite Relevant Today and Tomorrow. In IISWC (2006), p. 119.[16] Snelling, D. F., and Egan, G. K. A Comparative Study of Data-Flow Architectures. Tech. Rep. UMCS-94-4-3, 1994.[17] Tang, X. Compiling for Multithreaded Architectures. PhD thesis, University of Delaware, Delaware, USA, Fall 1999.[18] Tian, X., Chen, Y.-K., Girkar, M., Ge, S., Lienhart, R., and Shah, S. Exploring the Use of Hyper-Threading Technology for Multimedia Applications with Intel R OpenMP* Compiler. In IPDPS (2003), p. 36.[19] Tvrdik, P. Topics in parallel computing - PRAM models. http://pages.cs.wisc.edu/ tvrdik/2/html/Section2.html, January 1999.[20] Virding, R., Wikstr¨ m, C., and Williams, M. o Concurrent programming in ERLANG (2nd ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK, UK, 1996.