Federation of clusters–Clusters: sets of geographically proximal Autonomous Systems (AS)–O(10^3) nodes per cluster•A dynamic set of relatively capable nodes, Supernodes, manage–Resources within a cluster such as devices, users, etc.–Portions of the file system namespace•Clusters connected by a system wide structured overlay
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram
Programming Abstractions for Smart Apps on Clouds<br />Prof. D. Janakiram,<br />Professor, Dept of CSE, <br />IIT, Madras<br />
Acknowledgements<br />Work on Deformable Mesh Abstractions is joint work with GeetaIyer and SriramKailasam<br />Work on Edge Node File Systems is joint work with Kovendhan<br />Work on Deformable Mesh Abstractions is funded by Yahoo Research<br />
Introduction<br />Cloud computing: provides pay-for-use access to compute and storage resources over the Internet.<br />Smart applications: intelligence embedded within the application (e.g. Recommender systems)<br />Computation, data requirements and algorithms increasingly becoming complex.<br />Popular programming models for cloud: MapReduce, Dryad.<br />Are these right abstractions for smart apps?<br />
MapReduce Origins<br />Primary motivation:<br />To facilitate indexing, searching, sorting like operations on massive datasets over large resources.<br />Inspired from map and reduce primitives in LISP.<br />Requirement to perform computations on key-value pairs to generate intermediate key-value pairs and reduce all values with the same key.<br />Runtime responsible for parallelization of map and reduce tasks, and handles other low level details.<br />
Limitations and Proposed Extensions<br />Limitations in original MR model:<br />Input/output restricted to key-value pairs.<br />Jobs are loosely synchronized (no connected computation).<br />No support for iteration and recursion.<br />Doesn’t directly support multiple inputs for a job.<br />Optimized for batch processing.<br />Different nodes are assumed to perform work roughly at the same rate.<br />Inherent assumption that all tasks require the same amount of time.<br />Extensions:<br />IterativeMR: <br />adds support for iterations <br />relies on long running mapreduce tasks and streaming data between iterations<br />Spark:<br />Supports iterations and interactive queries. <br />Each iteration is handled as a separate MapReduce job, incurring job submission overheads.<br />Streaming makes fault tolerance difficult.<br />
Basic Database Operations<br />Projection<br />Selection<br />Aggregation<br />Join, Cartesian product, Set operations<br />Only the unary operations can be directly modeled with the original MapReduce framework.<br />There is no direct support for operations over multiple, possibly heterogeneous input data sources.<br />Can be done indirectly by chaining extra MapReduce steps.<br />
Dryad & DryadLINQ<br />Motivated primarily from the parallel databases.<br />Makes the communication graph explicit.<br />Execution graph expressed as Directed Acyclic Graph (DAG).<br />DryadLINQ allows computations to be expressed in terms of LINQ operators (similar to SQL operators)<br />Automatically parallelized by Dryad execution engine.<br />Supports multiple datasets and runtime optimizations of complete execution graph.<br />
Limitations<br />Lacks support for recursively spawning new tasks as computation proceeds.<br />Adaptive computations like AI planning, branch-and-bound cannot be supported directly.<br />
Smart Apps<br />Key aspects/ requirements<br /><ul><li>As computation proceeds, search space expands with newly generated data; requires support for spawning new tasks on-the-fly.
Different nodes executing in parallel needs to communicate; requires support for a shared communication model.
Data partitioning changes, as the computation proceeds.
Efficient support for fixed number of iterations or condition based termination.
Real world graphs may not be captured by hash-based partitioning; alternate partitioning schemes.</li></ul>Classes of Applications<br />AI planning<br />Decision tree algorithms<br />Association rule mining<br />Recommender systems<br />Data mining<br />Graph algorithms<br />Clustering algorithms<br />
Deformable Mesh Abstraction<br />Focus: <br />New programming model targeted towards wider applications that cannot be modeled efficiently using existing frameworks.<br />At the same time, support MapReduce-like computations efficiently.<br />Bring out clear separation between programmer expressibility issues and runtime environment issues.<br />
Heuristic-guided Problem Solving<br />General Methodology (e.g. AI planning)<br />Set of actions are evaluated in parallel on the problem state.<br />Newly generated states are inserted into the queue, based on a heuristic value.<br />Best state is selected from the queue for further processing. <br />Iteration continues till the goal state is reached.<br />Requirements<br />State of the queue needs to be preserved across iterations.<br />On-the-fly evaluation of termination condition to decide the number of iterations. <br />
Case Study: 1. Sapa Planner*<br />Current State and all actions<br />Splits: based on applicable actions<br />Current State, <br />appAction1<br />Current State, <br />appActionn<br />Current State, <br />appAction3<br />Current State, <br />appAction2<br />Solve1<br />Solve2<br />Solve3<br />Solven<br />Evaluate action on current state and compute heuristic<br />Communication to perform enqueue()<br />Combine<br />Distributed Priority Queue sorted by heuristic value<br />Communication to perform dequeue()<br />Select the next state and repeat again<br />*M. B. Do and S. Kambhampati, “Sapa: a multi-objective metric temporal planner,” J. Artif. Int. Res., vol. 20, no. 1, pp. 155–194, 2003.<br />
Modeling Sapa Planner using DMA<br />Solve tasks assigned to different machines, that evaluates actions on a particular state in parallel.<br />Recursive split facilitated through invokeSplit() call from within Combine.<br />Preliminary Result:<br />Split, Solve and Combine operations are modeled with minimal modification of sequential planner code.<br />Shared information required for Split, Solve and Combine operations are loaded only once on different machines, thus avoiding recursive split overheads. <br />
Case Study: 2. SGPlan4 Planner*<br />Goals<br />Subgoal1<br />Splits: constraint based subgoal partitioning<br />Subgoal2<br />Subgoal3<br />Subgoaln<br />Solve1<br />Subgoal11<br />Splits: based on landmark analysis<br />Subgoal12<br />Subgoal1n<br />Evaluate actions applicable on the current state<br />Solve11<br />Solve12<br />Solve1n<br />Splits: based on path optimization<br />Communication based on global constraints<br />Combine the subplans and update the penalty value of global constraints<br />Combine & evaluate<br />Combine & evaluate<br />Check the producible resources and repeat again<br />Combine & evaluate<br />*Chen Y. X., Wah B. W. and Hsu C. W. “Temporal planning using subgoal partitioning and resolution in SGPlan”, J. of Artificial Intelligence Research 2006.<br />
Edge Node File System (ENFS) *<br />ENFS Architecture<br />Metadata management is distributed <br />amongst supernodes.<br />Centrally managed metadata at the <br />namenode.<br />*K. Ponnavaikko and D. Janakiram, “The edge node file system: A distributed file system for high performance computing,” Scalable Computing: Practice and Experience, vol. 10, pp. 111–114, 2009<br />
Comparing execution time for Andrew Benchmark(AB) runs <br />
DMA Runtime<br /><ul><li>Supernode acts as DMA co-ordinator, responsible for scheduling and monitoring the progress of submitted job.
Scheduling considers data locality and node capability information.
Underlying file system is responsible for providing fault tolerance support.
Supernode responsible for maintaining shared storage’s metadata, while the shared storage itself is distributed across cluster nodes.
Support for continuous aggregation during Combine() to minimize the effect of synchronization delay.</li></li></ul><li>Additional abstraction support<br />Multiple users simultaneously accessing the file for performing<br />changes to file contents.<br />high performance computations on file contents.<br />Maintaining up-to-date information about the changes without incurring overheads.<br />compute-aggregate-recomputeabstraction allows consistent computation on file contents.<br />recompute() is applied only on updated records and not on entire file.<br />
Extending DMA on Hadoop<br />Clear separation between expressibility issues and runtime issues, facilitate extending DMA to Hadoop environment.<br />Advantages<br />DMA interfaces can exploit the efficient runtime provided by Hadoop.<br />At the same time, wider class of applications can be captured.<br />