• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Topic 7: Shortcomings in the MapReduce Paradigm
 

Topic 7: Shortcomings in the MapReduce Paradigm

on

  • 854 views

Cloud Computing Workshop 2013, ITU

Cloud Computing Workshop 2013, ITU

Statistics

Views

Total Views
854
Views on SlideShare
854
Embed Views
0

Actions

Likes
0
Downloads
35
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Topic 7: Shortcomings in the MapReduce Paradigm Topic 7: Shortcomings in the MapReduce Paradigm Presentation Transcript

    • 7: Shortcomings in the MapReduce Paradigm Zubair Nabi zubair.nabi@itu.edu.pk April 19, 2013Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 1 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 2 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 3 / 31
    • Users1 Adobe: Several areas from social services to unstructured data storage and processing 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
    • Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
    • Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
    • Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) LinkedIn: 3 clusters with collectively 4000 nodes 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
    • Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) LinkedIn: 3 clusters with collectively 4000 nodes Twitter: To store and process Tweets and log files 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
    • Users1 Adobe: Several areas from social services to unstructured data storage and processing eBay: 532 nodes cluster storing 5.3PB of data Facebook: Used for reporting/analytics; one cluster with 1100 nodes (12PB) and another with 300 nodes (3PB) LinkedIn: 3 clusters with collectively 4000 nodes Twitter: To store and process Tweets and log files Yahoo!: Multiple clusters with collectively 40000 nodes; largest cluster has 4500 nodes! 1 http://wiki.apache.org/hadoop/PoweredBy Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push It is just rehashing old database concepts Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push It is just rehashing old database concepts It is missing most DBMS functionalities, such as updates, transactions, etc. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • But all is not well Over the years, Hadoop has become a one-size-fits-all solution to data intensive computing As early as 2008, David DeWitt and Michael Stonebraker asserted that MapReduce was a “major step backwards” for data intensive computing They opined: MapReduce is a major step backwards in database access because it negates schema and is too low-level It has a sub-optimal implementation as it, makes use of brute force instead of indexing, does not handle skew, and uses data pull instead of push It is just rehashing old database concepts It is missing most DBMS functionalities, such as updates, transactions, etc. It is incompatible with DBMS tools, such as human visualization, data replication from one DBMS to another, etc. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 6 / 31
    • Introduction Due to the uneven distribution of intermediate key/value pairs some reduce workers end up doing more work Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
    • Introduction Due to the uneven distribution of intermediate key/value pairs some reduce workers end up doing more work Such reducers become “stragglers” Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
    • Introduction Due to the uneven distribution of intermediate key/value pairs some reduce workers end up doing more work Such reducers become “stragglers” A large number of real-world applications follow long-tailed distributions (Zipf-like) Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
    • Wordcount and skew Text corpora have a Zipfian skew, i.e. a very small number of words account for most occurrences Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
    • Wordcount and skew Text corpora have a Zipfian skew, i.e. a very small number of words account for most occurrences For instance, of 242,758 words in the dataset used to generate the figure, the 10, 100, and 1000 most frequent words account for 22%, 43%, and 64% of the entire set Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
    • Wordcount and skew Text corpora have a Zipfian skew, i.e. a very small number of words account for most occurrences For instance, of 242,758 words in the dataset used to generate the figure, the 10, 100, and 1000 most frequent words account for 22%, 43%, and 64% of the entire set Such skewed intermediate results lead to uneven distribution of workload across reduce workers Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page The skew in intermediate data exists due to the huge disparity in the number of incoming links across pages on the Internet Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page The skew in intermediate data exists due to the huge disparity in the number of incoming links across pages on the Internet The scale of the problem is evident when we consider the fact that Google currently indexes more than 25 billion webpages with skewed links Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Page rank and skew Even Google’s implementation of its core PageRank algorithm is plagued by the skew problem Google uses PageRank to calculate a webpage’s relevance for a given search query Map: Emit the outlinks for each page Reduce: Calculate rank per page The skew in intermediate data exists due to the huge disparity in the number of incoming links across pages on the Internet The scale of the problem is evident when we consider the fact that Google currently indexes more than 25 billion webpages with skewed links For instance, Facebook has 49,376,609 incoming links (at the time of writing) while the personal webpage of the presenter only has 4 (=)) Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
    • Zipf distributions are everywhere Followed by Inverted Indexing, Publish/Subscribe systems, fraud detection, and various clustering algorithms Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
    • Zipf distributions are everywhere Followed by Inverted Indexing, Publish/Subscribe systems, fraud detection, and various clustering algorithms P2P systems have Zipf distributions too both in terms of users and content Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
    • Zipf distributions are everywhere Followed by Inverted Indexing, Publish/Subscribe systems, fraud detection, and various clustering algorithms P2P systems have Zipf distributions too both in terms of users and content Web caching schemes as well as email and social networks Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 11 / 31
    • Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
    • Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” The framework launches a speculative copy of each straggler on another machine expecting it to finish quickly Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
    • Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” The framework launches a speculative copy of each straggler on another machine expecting it to finish quickly Without this, the overall job completion time is dictated by the slowest straggler Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
    • Introduction In the MapReduce model, tasks which take exceptionally long are labelled “stragglers” The framework launches a speculative copy of each straggler on another machine expecting it to finish quickly Without this, the overall job completion time is dictated by the slowest straggler On Google clusters, speculative execution can reduce job completion by 44% Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
    • Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
    • Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
    • Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
    • Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
    • Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time 5 As tasks finish in waves, a task with a low progress score is most likely a straggler Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
    • Hadoop’s assumptions regarding speculation 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime 3 There is no cost of launching a speculative cost on an otherwise idle slot/node 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time 5 As tasks finish in waves, a task with a low progress score is most likely a straggler 6 Tasks within the same phase, require roughly the same amount of work Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
    • Assumptions 1 and 2 1 All nodes are equal, i.e. they can perform work at more or less the same rate Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
    • Assumptions 1 and 2 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
    • Assumptions 1 and 2 1 All nodes are equal, i.e. they can perform work at more or less the same rate 2 Tasks make progress at a constant rate throughout their lifetime Both breakdown in heterogeneous environments which consist of multiple generations of hardware Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
    • Assumption 3 3 There is no cost of launching a speculative cost on an otherwise idle slot/node Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
    • Assumption 3 3 There is no cost of launching a speculative cost on an otherwise idle slot/node Breaks down due to shared resources Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
    • Assumption 4 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
    • Assumption 4 4 The progress score of a task captures the fraction of its total work that it has done. Specifically, the shuffle, merge, and reduce logic phases each take roughly 1/3 of the total time Breaks down due the fact that in reduce tasks the shuffle phase takes the longest time as opposed to the other 2 Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
    • Assumption 5 5 As tasks finish in waves, a task with a low progress score is most likely a straggler Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
    • Assumption 5 5 As tasks finish in waves, a task with a low progress score is most likely a straggler Breaks down due to the fact that task completion is spread across time due to uneven workload Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
    • Assumption 6 6 Tasks within the same phase, require roughly the same amount of work Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
    • Assumption 6 6 Tasks within the same phase, require roughly the same amount of work Breaks down due to data skew Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 19 / 31
    • Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
    • Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Hacks need to be put into place for different data flow, such as joins or multiple stages Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
    • Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Hacks need to be put into place for different data flow, such as joins or multiple stages Custom code has to be written for common DB operations, such as projection and filtering Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
    • Introduction The one-input, two-stage data flow is extremely rigid for ad-hoc analysis of large datasets Hacks need to be put into place for different data flow, such as joins or multiple stages Custom code has to be written for common DB operations, such as projection and filtering The opaque nature of map and reduce functions makes it impossible to perform optimizations, such as operator reordering Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 21 / 31
    • Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
    • Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Simplifies fault-tolerance Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
    • Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Simplifies fault-tolerance Reducers have to pull their input instead of the mappers pushing it Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
    • Introduction In case of MapReduce, the entire output of a map or a reduce task needs to be materialized to local storage before the next stage can commence Simplifies fault-tolerance Reducers have to pull their input instead of the mappers pushing it Negates pipelining, result estimation, and continuous queries (stream processing) Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 23 / 31
    • Introduction 1 Not all applications can be broken down into just two-phases, such as complex SQL-like queries Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
    • Introduction 1 Not all applications can be broken down into just two-phases, such as complex SQL-like queries 2 Tasks take in just one input and produce one output Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 25 / 31
    • Introduction 1 Hadoop is widely employed for iterative computations Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
    • Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
    • Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop 3 Mahout uses an external driver program to submit multiple jobs to Hadoop and perform a convergence test Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
    • Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop 3 Mahout uses an external driver program to submit multiple jobs to Hadoop and perform a convergence test 4 No fault-tolerance and overhead of job submission Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
    • Introduction 1 Hadoop is widely employed for iterative computations 2 For machine learning applications, the Apache Mahout library is used atop Hadoop 3 Mahout uses an external driver program to submit multiple jobs to Hadoop and perform a convergence test 4 No fault-tolerance and overhead of job submission 5 Loop-invariant data is materialized to storage Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
    • Outline 1 Hadoop everywhere! 2 Skew 3 Heterogeneous Environment 4 Low-level Programming Interface 5 Strictly Batch-processing 6 Single-input/single output and Two-phase 7 Iterative and Recursive Applications 8 Incremental Computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 27 / 31
    • Introduction 1 Most workloads processed by MapReduce are incremental by nature, i.e. MapReduce jobs often run repeatedly with small changes in their input Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
    • Introduction 1 Most workloads processed by MapReduce are incremental by nature, i.e. MapReduce jobs often run repeatedly with small changes in their input 2 For instance, most iterations of PageRank run with very small modifications Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
    • Introduction 1 Most workloads processed by MapReduce are incremental by nature, i.e. MapReduce jobs often run repeatedly with small changes in their input 2 For instance, most iterations of PageRank run with very small modifications 3 Unfortunately, even with a small change in input, MapReduce re-performs the entire computation Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
    • References 1 MapReduce: A major step backwards: http://homes.cs.washington.edu/~billhowe/ mapreduce_a_major_step_backwards.html 2 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42. 3 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD ’08). ACM, New York, NY, USA, 1099-1110. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 29 / 31
    • References (2) 4 Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI’10). USENIX Association, Berkeley, CA, USA. 5 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys ’07). ACM, New York, NY, USA, 59-72. 6 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI’11). USENIX Association, Berkeley, CA, USA. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 30 / 31
    • References (3) 7 Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC ’11). ACM, New York, NY, USA. Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 31 / 31