Topic 7: Shortcomings in the MapReduce Paradigm

7: Shortcomings in the MapReduce Paradigm

Zubair Nabi

zubair.nabi@itu.edu.pk

April 19, 2013

Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 1 / 31

Outline

1 Hadoop everywhere!

2 Skew

3 Heterogeneous Environment

4 Low-level Programming Interface

5 Strictly Batch-processing

6 Single-input/single output and Two-phase

7 Iterative and Recursive Applications

8 Incremental Computation


Outline


2 Skew








Users1

Adobe: Several areas from social services to unstructured data storage
and processing

1
http://wiki.apache.org/hadoop/PoweredBy

Users1

and processing
eBay: 532 nodes cluster storing 5.3PB of data

1

Users1

and processing
Facebook: Used for reporting/analytics; one cluster with 1100 nodes
(12PB) and another with 300 nodes (3PB)

1

Users1

and processing
LinkedIn: 3 clusters with collectively 4000 nodes

1

Users1

and processing
Twitter: To store and process Tweets and log ﬁles

1

Users1

and processing
Twitter: To store and process Tweets and log ﬁles
Yahoo!: Multiple clusters with collectively 40000 nodes; largest cluster
has 4500 nodes!

1

But all is not well

Over the years, Hadoop has become a one-size-ﬁts-all solution to data
intensive computing


But all is not well

intensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted that
MapReduce was a “major step backwards” for data intensive
computing


But all is not well

intensive computing
computing
They opined:
MapReduce is a major step backwards in database access because it
negates schema and is too low-level


But all is not well

intensive computing
computing
They opined:
It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead of
push


But all is not well

intensive computing
computing
They opined:
push
It is just rehashing old database concepts


But all is not well

intensive computing
computing
They opined:
push
It is missing most DBMS functionalities, such as updates, transactions,
etc.


But all is not well

intensive computing
computing
They opined:
push
It is missing most DBMS functionalities, such as updates, transactions,
etc.
It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.


Outline


2 Skew








Introduction

Due to the uneven distribution of intermediate key/value pairs some
reduce workers end up doing more work


Introduction

Such reducers become “stragglers”


Introduction

Such reducers become “stragglers”
A large number of real-world applications follow long-tailed distributions
(Zipf-like)


Wordcount and skew
Text corpora have a Zipﬁan skew, i.e. a very small number of words
account for most occurrences


Wordcount and skew

For instance, of 242,758 words in the dataset used to generate the
ﬁgure, the 10, 100, and 1000 most frequent words account for 22%,
43%, and 64% of the entire set


Wordcount and skew

For instance, of 242,758 words in the dataset used to generate the
ﬁgure, the 10, 100, and 1000 most frequent words account for 22%,
43%, and 64% of the entire set
Such skewed intermediate results lead to uneven distribution of
workload across reduce workers

Page rank and skew

Even Google’s implementation of its core PageRank algorithm is
plagued by the skew problem


Page rank and skew

Google uses PageRank to calculate a webpage’s relevance for a given
search query


Page rank and skew

search query
Map: Emit the outlinks for each page


Page rank and skew

search query
Reduce: Calculate rank per page


Page rank and skew

search query
The skew in intermediate data exists due to the huge disparity in the
number of incoming links across pages on the Internet


Page rank and skew

search query
The scale of the problem is evident when we consider the fact that
Google currently indexes more than 25 billion webpages with skewed
links


Page rank and skew

search query
The scale of the problem is evident when we consider the fact that
Google currently indexes more than 25 billion webpages with skewed
links
For instance, Facebook has 49,376,609 incoming links (at the time of
writing) while the personal webpage of the presenter only has 4 (=))


Zipf distributions are everywhere

Followed by Inverted Indexing, Publish/Subscribe systems, fraud
detection, and various clustering algorithms



P2P systems have Zipf distributions too both in terms of users and
content



P2P systems have Zipf distributions too both in terms of users and
content
Web caching schemes as well as email and social networks


Outline


2 Skew








Introduction

In the MapReduce model, tasks which take exceptionally long are
labelled “stragglers”


Introduction

The framework launches a speculative copy of each straggler on
another machine expecting it to ﬁnish quickly


Introduction

Without this, the overall job completion time is dictated by the slowest
straggler


Introduction

Without this, the overall job completion time is dictated by the slowest
straggler
On Google clusters, speculative execution can reduce job completion
by 44%


Hadoop’s assumptions regarding speculation

1 All nodes are equal, i.e. they can perform work at more or less the
same rate



same rate
2 Tasks make progress at a constant rate throughout their lifetime



same rate
3 There is no cost of launching a speculative cost on an otherwise idle
slot/node



same rate
slot/node
4 The progress score of a task captures the fraction of its total work that
it has done. Speciﬁcally, the shufﬂe, merge, and reduce logic phases
each take roughly 1/3 of the total time



same rate
slot/node
5 As tasks ﬁnish in waves, a task with a low progress score is most likely
a straggler



same rate
slot/node
a straggler
6 Tasks within the same phase, require roughly the same amount of work


Assumptions 1 and 2

same rate


Assumptions 1 and 2

same rate

Both breakdown in heterogeneous environments which consist of
multiple generations of hardware


Assumption 3

slot/node


Assumption 3

slot/node

Breaks down due to shared resources


Assumption 4



Assumption 4


Breaks down due the fact that in reduce tasks the shufﬂe phase takes
the longest time as opposed to the other 2


Assumption 5

a straggler


Assumption 5

a straggler

Breaks down due to the fact that task completion is spread across time
due to uneven workload


Assumption 6



Assumption 6


Breaks down due to data skew


Outline


2 Skew








Introduction

The one-input, two-stage data ﬂow is extremely rigid for ad-hoc
analysis of large datasets


Introduction

Hacks need to be put into place for different data ﬂow, such as joins or
multiple stages


Introduction

multiple stages
Custom code has to be written for common DB operations, such as
projection and ﬁltering


Introduction

multiple stages
Custom code has to be written for common DB operations, such as
projection and ﬁltering
The opaque nature of map and reduce functions makes it impossible to
perform optimizations, such as operator reordering


Outline


2 Skew








Introduction

In case of MapReduce, the entire output of a map or a reduce task
needs to be materialized to local storage before the next stage can
commence


Introduction

commence
Simpliﬁes fault-tolerance


Introduction

commence
Reducers have to pull their input instead of the mappers pushing it


Introduction

commence
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (stream
processing)


Outline


2 Skew








Introduction

1 Not all applications can be broken down into just two-phases, such as
complex SQL-like queries


Introduction

1 Not all applications can be broken down into just two-phases, such as
complex SQL-like queries
2 Tasks take in just one input and produce one output


Outline


2 Skew








Introduction

1 Hadoop is widely employed for iterative computations


Introduction

2 For machine learning applications, the Apache Mahout library is used
atop Hadoop


Introduction

atop Hadoop
3 Mahout uses an external driver program to submit multiple jobs to
Hadoop and perform a convergence test


Introduction

atop Hadoop
4 No fault-tolerance and overhead of job submission


Introduction

atop Hadoop
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage


Outline


2 Skew








Introduction

1 Most workloads processed by MapReduce are incremental by nature,
i.e. MapReduce jobs often run repeatedly with small changes in their
input


Introduction

input
2 For instance, most iterations of PageRank run with very small
modiﬁcations


Introduction

input
2 For instance, most iterations of PageRank run with very small
modiﬁcations
3 Unfortunately, even with a small change in input, MapReduce
re-performs the entire computation


References

1 MapReduce: A major step backwards:
http://homes.cs.washington.edu/~billhowe/
mapreduce_a_major_step_backwards.html
2 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and
Ion Stoica. 2008. Improving MapReduce performance in
heterogeneous environments. In Proceedings of the 8th USENIX
conference on Operating systems design and implementation
(OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42.
3 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.


References (2)

4 Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,
Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In
Proceedings of the 7th USENIX conference on Networked systems
design and implementation (NSDI’10). USENIX Association, Berkeley,
CA, USA.
5 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
Fetterly. 2007. Dryad: distributed data-parallel programs from
sequential building blocks. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007
(EuroSys ’07). ACM, New York, NY, USA, 59-72.
6 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven
Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal
execution engine for distributed data-ﬂow computing. In Proceedings of
the 8th USENIX conference on Networked systems design and
implementation (NSDI’11). USENIX Association, Berkeley, CA, USA.


References (3)

7 Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A.
Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incremental
computations. In Proceedings of the 2nd ACM Symposium on Cloud
Computing (SOCC ’11). ACM, New York, NY, USA.


Topic 7: Shortcomings in the MapReduce Paradigm

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Topic 7: Shortcomings in the MapReduce Paradigm

Similar to Topic 7: Shortcomings in the MapReduce Paradigm (20)

More from Zubair Nabi

More from Zubair Nabi (20)

Recently uploaded

Recently uploaded (20)

Topic 7: Shortcomings in the MapReduce Paradigm