mapreduce_presentation

Background Main Points Conclusion References
Flexible and Eﬃcient
Data Flow for MapReduce
Adam Martini
University of Oregon
martini@cs.uoregon.edu
Thursday 5th June, 2014
Flexible and Eﬃcient Data Flow for MapReduce UO

What is MapReduce?
A framework popularized by Google circa
2004
Provides transparent scalability and fault
tolerance for “embarrassingly parallel”
tasks
Why do we need it?
Initially designed to solve PageRank
scalability issues
Big Data is everywhere!
Non-parallel programming experts need a
way to eﬃciently manipulate very large
datasets

Strengths
1) Massively scalable
2) Can operate on
commodity hardware
3) Transparency:
Data locality
Fault tolerance

Weaknessess
Flexibility
Data flow (and algorithm design) is forced to conform to an
oversimplified computational model
Efficiency (wrt Data Flow)
Programs handle data non-optimally ⇒ performance loss
Take Away
Improved flexibility can improve efficiency, but the inverse does
not hold
Focus on performance gains leads to a wide variety of
application specific MapReduce derivatives

Data Flow Improvements
Extensions to existing frameworks focus on more flexible work flow,
while new frameworks for specific applications focus on improved
data flow.
3 Main Directions:
Iterative MapReduce
Graph Dependent Models
Time Aware Models

Iterative MapReduce
Primarily inspired by Machine Learning applications that iteratively
update a set of dynamic variables using a set of static data
instances.
Why
Data IO is slow!
How
Hold invariant data in memory
Hold reductions in memory for convergence checking

Iterative MapReduce: Frameworks
Spark – A very popular framework that supports lazy
evaluation, eﬃcient fault tolerance, and iterative machine
learning tasks... much faster than Hadoop!
HaLoop – The ﬁrst framework to introduce convergence
checking data optimizations
Twister – Iterative MapReduce with an optional combine
step before reduction (also available in Hadoop)
Shredder/Incoop – A GPU accelerated iterative MapReduce
framework with an intelligent data chunking algorithms

Graph Dependent Models
Two Flavors – Frameworks built to support graph updates
(PageRank), and frameworks that build data ﬂow DAG’s from
execution requests. Both ﬂavors rely on graph structure for
optimization.
Why
Graphs contain information about dependencies!
How
Limit computation based on need
Recognize potential for optimization based on structure
(cycles, vertical and horizontal fusion)

Graph Dependent Models: Frameworks
Percolator – Developed by Google to replace MapReduce.
Updates only a portion of PageRank indices by propagating
the inﬂuence of new, or altered, web pages.
Pregel – A general framework for superstep-based graphical
updates. Allows for dynamic changes to graph structure.
Ceil – Uses runtime analysis of dependency graphs to
dynamically optimize execution. Runtime analysis allows for
detection of iterative/recursive patterns.
Flumejava – Preprocesses execution plan to generate a DAG,
which is used to perform fusion-based optimizations.

Time Aware Models
Time aware models use logical timestamps in conjunction with
graphical data flow models to achieve greater flexibility in program
design, while reducing computational latency.
Why
More flexible program design eliminates the need to combine
several application specific frameworks
How
More expressive language to define complex parallel patterns
A communication transparency layer to provide automated
event coordination

Graph Dependent Models: Frameworks
Naiad – Represents computation using a graph with stateful
nodes that are capable of sending data ﬂow messages without
global coordination. Asynchronous events are coordinated by
providing a guarantee that messages will be delivered at a
given time, or that the system will be capable of delivering a
message by a given time. The results is a system that
outperforms popular frameworks in their target domain.

Take Away
MapReduce is a simple model that
provides useful transparency, but it
was not meant to do everything!
With increased ﬂexibility comes
programmer responsibility...
The Ultimate Goal
A single framework that provides
transparent scalability/fault
tolerance, a friendly interface,
automatic optimization, and
expressivity.

References I
Jeﬀrey Dean and Sanjay Ghemawat,
“Mapreduce: simpliﬁed data processing on large clusters,”
Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and
Bongki Moon,
“Parallel data processing with mapreduce: a survey,”
AcM sIGMoD Record, vol. 40, no. 4, pp. 11–20, 2012.

References II
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin
Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica,
“Resilient distributed datasets: A fault-tolerant abstraction for in-memory
cluster computing,”
in Proceedings of the 9th USENIX conference on Networked Systems
Design and Implementation. USENIX Association, 2012, pp. 2–2.
Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert,
Ilan Horn, Naty Leiser, and Grzegorz Czajkowski,
“Pregel: a system for large-scale graph processing,”
in Proceedings of the 2010 ACM SIGMOD International Conference on
Management of data. ACM, 2010, pp. 135–146.

References III
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams,
Robert R Henry, Robert Bradshaw, and Nathan Weizenbaum,
“Flumejava: easy, eﬃcient data-parallel pipelines,”
in ACM Sigplan Notices. ACM, 2010, vol. 45, pp. 363–375.
Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul
Barham, and Martin Abadi,
“Naiad: a timely dataﬂow system,”
in Proceedings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles. ACM, 2013, pp. 439–455.

References IV
Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A Thekkath,
Yuan Yu, and Li Zhuang,
“Nectar: Automatic management of data and computation in
datacenters.,”
in OSDI, 2010, pp. 75–88.

The End

mapreduce_presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (6)

Similar to mapreduce_presentation

Similar to mapreduce_presentation (20)

mapreduce_presentation