Distributed Processing Frameworks

Distributed Processing
Frameworks
Author: Antonios Katsarakis

Literature
• MapReduce: Simplified Data Processing on Large Clusters
Jeff Dean et al. - OSDI’04.
• Spark: Cluster Computing with Working Sets
M. Zaharia et al. - HotCloud’10.

Why Big Data?
• More data to process: IoT, smart devices, web applications
- About 2.3 trillion GB of new data are generated every day
• Growth of CPU performance cannot keep up with increasing
amount of data to process
• This leads us to the Big Data era
- Big data: Data sets are so large that the processing power of a
single machine is inadequate to deal with them
• We need to find ways to process these massive amounts of data

MapReduce
• Proposed by Jeff Dean et al. (Google) 2004
- Cited more than 18k
• A programming model that enables the parallel
and distributed processing of large data sets
• Typical MapReduce Program:
- Read Data
- Map: filtering of the data
- Shuffle and short
- Reduce: summary operation on data
- Write the Results
ReduceReduce
Input Data
1/3
Input
1/3
Input
1/3
Input
Map Map Map
Interm.
Data
Interm.
Data
Interm.
Data
Output
Data
Output
Data

Critical Reflection
• Outcome:
- Novel idea that lead to a whole new era of distributed systems
- Big impact in industry (Hadoop MapReduce)
- Lowered the cost of computations
• Limitations:
- Restricted to batch processing
- It only support map and reduce operations
- The shuffling phase introduces overheads

Spark
• Proposed by Matei Zaharia et al. 2010
- Cited 1.5k
• Another programming model based on
higher-ordered functions that execute
user-defined functions in parallel
• Aims to replace MapReduce in industry
• Main Ideas:
- Represent the computations as DAGs
- Cache datasets into memory

Spark Model
• Resilient Distributed
Datasets (RRDs):
immutable collections of
objects spread across a
cluster
• Operations over RDDs:
1.Transformations: lazy
operators that create new
RDDs
2.Actions: launch a
computation on an RDD
Pipelined
RDD1
var count = readFile(…)
.map(…)
.filter(..)
.reduceByKey()
.count()
File splited
into chunks
(RDD0)
RDD2
RDD3
RDD4
Result
Job (RDD) Graph
Stage1St.2

Critical Reflection
• Benefits:
- High level API
- Support more applications types
- Performance optimizations
• Limitations:
- Detailed performance analysis on the thread level is hard
- Multipurpose application support makes performance improvements and
tuning really challenging
- The shuffling phase introduces overheads

Conclusion
• Clusters provide the computational power to
process Big Data
• MapReduce allows developers to build programs for
clusters
• Spark tries to overcome limitations of MapReduce
• These systems introduce many challenges in terms
of measuring and improving their performance

Distributed Processing Frameworks

More Related Content

What's hot

Similar to Distributed Processing Frameworks

More from Antonios Katsarakis

Recently uploaded

Distributed Processing Frameworks

Editor's Notes