IEEE Parallel and distributed system 2016 Title and Abstract
mapreduce_presentation
1. Background Main Points Conclusion References
Flexible and Efficient
Data Flow for MapReduce
Adam Martini
University of Oregon
martini@cs.uoregon.edu
Thursday 5th June, 2014
Flexible and Efficient Data Flow for MapReduce UO
2. Background Main Points Conclusion References
What is MapReduce?
A framework popularized by Google circa
2004
Provides transparent scalability and fault
tolerance for “embarrassingly parallel”
tasks
Why do we need it?
Initially designed to solve PageRank
scalability issues
Big Data is everywhere!
Non-parallel programming experts need a
way to efficiently manipulate very large
datasets
Flexible and Efficient Data Flow for MapReduce UO
3. Background Main Points Conclusion References
Strengths
1) Massively scalable
2) Can operate on
commodity hardware
3) Transparency:
Data locality
Fault tolerance
Flexible and Efficient Data Flow for MapReduce UO
4. Background Main Points Conclusion References
Weaknessess
Flexibility
Data flow (and algorithm design) is forced to conform to an
oversimplified computational model
Efficiency (wrt Data Flow)
Programs handle data non-optimally ⇒ performance loss
Take Away
Improved flexibility can improve efficiency, but the inverse does
not hold
Focus on performance gains leads to a wide variety of
application specific MapReduce derivatives
Flexible and Efficient Data Flow for MapReduce UO
5. Background Main Points Conclusion References
Data Flow Improvements
Extensions to existing frameworks focus on more flexible work flow,
while new frameworks for specific applications focus on improved
data flow.
3 Main Directions:
Iterative MapReduce
Graph Dependent Models
Time Aware Models
Flexible and Efficient Data Flow for MapReduce UO
6. Background Main Points Conclusion References
Iterative MapReduce
Primarily inspired by Machine Learning applications that iteratively
update a set of dynamic variables using a set of static data
instances.
Why
Data IO is slow!
How
Hold invariant data in memory
Hold reductions in memory for convergence checking
Flexible and Efficient Data Flow for MapReduce UO
7. Background Main Points Conclusion References
Iterative MapReduce: Frameworks
Spark – A very popular framework that supports lazy
evaluation, efficient fault tolerance, and iterative machine
learning tasks... much faster than Hadoop!
HaLoop – The first framework to introduce convergence
checking data optimizations
Twister – Iterative MapReduce with an optional combine
step before reduction (also available in Hadoop)
Shredder/Incoop – A GPU accelerated iterative MapReduce
framework with an intelligent data chunking algorithms
Flexible and Efficient Data Flow for MapReduce UO
8. Background Main Points Conclusion References
Graph Dependent Models
Two Flavors – Frameworks built to support graph updates
(PageRank), and frameworks that build data flow DAG’s from
execution requests. Both flavors rely on graph structure for
optimization.
Why
Graphs contain information about dependencies!
How
Limit computation based on need
Recognize potential for optimization based on structure
(cycles, vertical and horizontal fusion)
Flexible and Efficient Data Flow for MapReduce UO
9. Background Main Points Conclusion References
Graph Dependent Models: Frameworks
Percolator – Developed by Google to replace MapReduce.
Updates only a portion of PageRank indices by propagating
the influence of new, or altered, web pages.
Pregel – A general framework for superstep-based graphical
updates. Allows for dynamic changes to graph structure.
Ceil – Uses runtime analysis of dependency graphs to
dynamically optimize execution. Runtime analysis allows for
detection of iterative/recursive patterns.
Flumejava – Preprocesses execution plan to generate a DAG,
which is used to perform fusion-based optimizations.
Flexible and Efficient Data Flow for MapReduce UO
10. Background Main Points Conclusion References
Time Aware Models
Time aware models use logical timestamps in conjunction with
graphical data flow models to achieve greater flexibility in program
design, while reducing computational latency.
Why
More flexible program design eliminates the need to combine
several application specific frameworks
How
More expressive language to define complex parallel patterns
A communication transparency layer to provide automated
event coordination
Flexible and Efficient Data Flow for MapReduce UO
11. Background Main Points Conclusion References
Graph Dependent Models: Frameworks
Naiad – Represents computation using a graph with stateful
nodes that are capable of sending data flow messages without
global coordination. Asynchronous events are coordinated by
providing a guarantee that messages will be delivered at a
given time, or that the system will be capable of delivering a
message by a given time. The results is a system that
outperforms popular frameworks in their target domain.
Flexible and Efficient Data Flow for MapReduce UO
12. Background Main Points Conclusion References
Take Away
MapReduce is a simple model that
provides useful transparency, but it
was not meant to do everything!
With increased flexibility comes
programmer responsibility...
The Ultimate Goal
A single framework that provides
transparent scalability/fault
tolerance, a friendly interface,
automatic optimization, and
expressivity.
Flexible and Efficient Data Flow for MapReduce UO
13. Background Main Points Conclusion References
References I
Jeffrey Dean and Sanjay Ghemawat,
“Mapreduce: simplified data processing on large clusters,”
Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and
Bongki Moon,
“Parallel data processing with mapreduce: a survey,”
AcM sIGMoD Record, vol. 40, no. 4, pp. 11–20, 2012.
Flexible and Efficient Data Flow for MapReduce UO
14. Background Main Points Conclusion References
References II
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica,
“Resilient distributed datasets: A fault-tolerant abstraction for in-memory
cluster computing,”
in Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation. USENIX Association, 2012, pp. 2–2.
Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert,
Ilan Horn, Naty Leiser, and Grzegorz Czajkowski,
“Pregel: a system for large-scale graph processing,”
in Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data. ACM, 2010, pp. 135–146.
Flexible and Efficient Data Flow for MapReduce UO
15. Background Main Points Conclusion References
References III
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams,
Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum,
“Flumejava: easy, efficient data-parallel pipelines,”
in ACM Sigplan Notices. ACM, 2010, vol. 45, pp. 363–375.
Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul
Barham, and Martin Abadi,
“Naiad: a timely dataflow system,”
in Proceedings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles. ACM, 2013, pp. 439–455.
Flexible and Efficient Data Flow for MapReduce UO
16. Background Main Points Conclusion References
References IV
Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A Thekkath,
Yuan Yu, and Li Zhuang,
“Nectar: Automatic management of data and computation in
datacenters.,”
in OSDI, 2010, pp. 75–88.
Flexible and Efficient Data Flow for MapReduce UO
17. Background Main Points Conclusion References
The End
Flexible and Efficient Data Flow for MapReduce UO