Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

1. Programming Abstractions for Smart Apps on Clouds Prof. D. Janakiram, Professor, Dept of CSE, IIT, Madras

2. Acknowledgements Work on Deformable Mesh Abstractions is joint work with GeetaIyer and SriramKailasam Work on Edge Node File Systems is joint work with Kovendhan Work on Deformable Mesh Abstractions is funded by Yahoo Research

3. Introduction Cloud computing: provides pay-for-use access to compute and storage resources over the Internet. Smart applications: intelligence embedded within the application (e.g. Recommender systems) Computation, data requirements and algorithms increasingly becoming complex. Popular programming models for cloud: MapReduce, Dryad. Are these right abstractions for smart apps?

4. MapReduce Origins Primary motivation: To facilitate indexing, searching, sorting like operations on massive datasets over large resources. Inspired from map and reduce primitives in LISP. Requirement to perform computations on key-value pairs to generate intermediate key-value pairs and reduce all values with the same key. Runtime responsible for parallelization of map and reduce tasks, and handles other low level details.

5. Limitations and Proposed Extensions Limitations in original MR model: Input/output restricted to key-value pairs. Jobs are loosely synchronized (no connected computation). No support for iteration and recursion. Doesn’t directly support multiple inputs for a job. Optimized for batch processing. Different nodes are assumed to perform work roughly at the same rate. Inherent assumption that all tasks require the same amount of time. Extensions: IterativeMR: adds support for iterations relies on long running mapreduce tasks and streaming data between iterations Spark: Supports iterations and interactive queries. Each iteration is handled as a separate MapReduce job, incurring job submission overheads. Streaming makes fault tolerance difficult.

6. Basic Database Operations Projection Selection Aggregation Join, Cartesian product, Set operations Only the unary operations can be directly modeled with the original MapReduce framework. There is no direct support for operations over multiple, possibly heterogeneous input data sources. Can be done indirectly by chaining extra MapReduce steps.

7. Dryad & DryadLINQ Motivated primarily from the parallel databases. Makes the communication graph explicit. Execution graph expressed as Directed Acyclic Graph (DAG). DryadLINQ allows computations to be expressed in terms of LINQ operators (similar to SQL operators) Automatically parallelized by Dryad execution engine. Supports multiple datasets and runtime optimizations of complete execution graph.

8. Limitations Lacks support for recursively spawning new tasks as computation proceeds. Adaptive computations like AI planning, branch-and-bound cannot be supported directly.

10. Different nodes executing in parallel needs to communicate; requires support for a shared communication model.

11. Data partitioning changes, as the computation proceeds.

12. Efficient support for fixed number of iterations or condition based termination.

13. Real world graphs may not be captured by hash-based partitioning; alternate partitioning schemes.Classes of Applications AI planning Decision tree algorithms Association rule mining Recommender systems Data mining Graph algorithms Clustering algorithms

14. Deformable Mesh Abstraction Focus: New programming model targeted towards wider applications that cannot be modeled efficiently using existing frameworks. At the same time, support MapReduce-like computations efficiently. Bring out clear separation between programmer expressibility issues and runtime environment issues.

16. capturing different programming paradigms efficiently.

17. recursive spawning of new tasks at runtime.

18. efficient and location independent communication support.

19. changing the “Shared nothing” viewpoint.

20. support operating on changing datasets.Unconnected Iterative Recursive (all-to-all) (point-to-point) Runtime creation Programming Paradigms

22. offering performance guarantees on unreliable environments.

23. handling heterogeneity in terms of

24. capability

25. storage

26. reliability

27. minimizing synchronization delay between different tasks.

30. Pipe abstraction supports location independent communication between different node and with shared structure; provides read()/write() functions.

31. Shared Structure can be instantiated to stacks, queues, hash-based structure, depending upon application requirement.

32. Recursive Splits can happen within Solve/ Combine.

33. Combine can be of 2 types:Split3 Splitn Split1 Split2 Split12 Split11 Solve2 Solve3 Solven Split11 Split12 Solve11 Solve12 Solve2 Solve3 Solven Solve11 Solve12 Combine Combine Combine Reduce-like Combine Hierarchical Combine

34. Heuristic-guided Problem Solving General Methodology (e.g. AI planning) Set of actions are evaluated in parallel on the problem state. Newly generated states are inserted into the queue, based on a heuristic value. Best state is selected from the queue for further processing. Iteration continues till the goal state is reached. Requirements State of the queue needs to be preserved across iterations. On-the-fly evaluation of termination condition to decide the number of iterations.

35. Case Study: 1. Sapa Planner* Current State and all actions Splits: based on applicable actions Current State, appAction1 Current State, appActionn Current State, appAction3 Current State, appAction2 Solve1 Solve2 Solve3 Solven Evaluate action on current state and compute heuristic Communication to perform enqueue() Combine Distributed Priority Queue sorted by heuristic value Communication to perform dequeue() Select the next state and repeat again *M. B. Do and S. Kambhampati, “Sapa: a multi-objective metric temporal planner,” J. Artif. Int. Res., vol. 20, no. 1, pp. 155–194, 2003.

36. Modeling Sapa Planner using DMA Solve tasks assigned to different machines, that evaluates actions on a particular state in parallel. Recursive split facilitated through invokeSplit() call from within Combine. Preliminary Result: Split, Solve and Combine operations are modeled with minimal modification of sequential planner code. Shared information required for Split, Solve and Combine operations are loaded only once on different machines, thus avoiding recursive split overheads.

37. Case Study: 2. SGPlan4 Planner* Goals Subgoal1 Splits: constraint based subgoal partitioning Subgoal2 Subgoal3 Subgoaln Solve1 Subgoal11 Splits: based on landmark analysis Subgoal12 Subgoal1n Evaluate actions applicable on the current state Solve11 Solve12 Solve1n Splits: based on path optimization Communication based on global constraints Combine the subplans and update the penalty value of global constraints Combine & evaluate Combine & evaluate Check the producible resources and repeat again Combine & evaluate *Chen Y. X., Wah B. W. and Hsu C. W. “Temporal planning using subgoal partitioning and resolution in SGPlan”, J. of Artificial Intelligence Research 2006.

38. Edge Node File System (ENFS) * ENFS Architecture Metadata management is distributed amongst supernodes. Centrally managed metadata at the namenode. *K. Ponnavaikko and D. Janakiram, “The edge node file system: A distributed file system for high performance computing,” Scalable Computing: Practice and Experience, vol. 10, pp. 111–114, 2009

39. Comparing ENFS & HDFS Directory creation Directory Stat Recursively changing directory attributes I/O throughput

40. Comparing execution time for Andrew Benchmark(AB) runs

42. Scheduling considers data locality and node capability information.

43. Underlying file system is responsible for providing fault tolerance support.

44. Supernode responsible for maintaining shared storage’s metadata, while the shared storage itself is distributed across cluster nodes.

46. Extending DMA on Hadoop Clear separation between expressibility issues and runtime issues, facilitate extending DMA to Hadoop environment. Advantages DMA interfaces can exploit the efficient runtime provided by Hadoop. At the same time, wider class of applications can be captured.

47. Thank you

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (7)

Similar to Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

Similar to Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Smart Apps on Clouds" by D. Janakiram

Editor's Notes