Distributed Processing
Frameworks
Author: Antonios Katsarakis
Literature
• MapReduce: Simplified Data Processing on Large Clusters
Jeff Dean et al. - OSDI’04.
• Spark: Cluster Computing with Working Sets
M. Zaharia et al. - HotCloud’10.
Why Big Data?
• More data to process: IoT, smart devices, web applications
- About 2.3 trillion GB of new data are generated every day
• Growth of CPU performance cannot keep up with increasing
amount of data to process
• This leads us to the Big Data era
- Big data: Data sets are so large that the processing power of a
single machine is inadequate to deal with them
• We need to find ways to process these massive amounts of data
MapReduce
• Proposed by Jeff Dean et al. (Google) 2004
- Cited more than 18k
• A programming model that enables the parallel
and distributed processing of large data sets
• Typical MapReduce Program:
- Read Data
- Map: filtering of the data
- Shuffle and short
- Reduce: summary operation on data
- Write the Results
ReduceReduce
Input Data
1/3
Input
1/3
Input
1/3
Input
Map Map Map
Interm.
Data
Interm.
Data
Interm.
Data
Output
Data
Output
Data
Critical Reflection
• Outcome:
- Novel idea that lead to a whole new era of distributed systems
- Big impact in industry (Hadoop MapReduce)
- Lowered the cost of computations
• Limitations:
- Restricted to batch processing
- It only support map and reduce operations
- The shuffling phase introduces overheads
Spark
• Proposed by Matei Zaharia et al. 2010
- Cited 1.5k
• Another programming model based on
higher-ordered functions that execute
user-defined functions in parallel
• Aims to replace MapReduce in industry
• Main Ideas:
- Represent the computations as DAGs
- Cache datasets into memory
Spark Model
• Resilient Distributed
Datasets (RRDs):
immutable collections of
objects spread across a
cluster
• Operations over RDDs:
1.Transformations: lazy
operators that create new
RDDs
2.Actions: launch a
computation on an RDD
Pipelined
RDD1
var count = readFile(…)
.map(…)
.filter(..)
.reduceByKey()
.count()
File splited
into chunks
(RDD0)
RDD2
RDD3
RDD4
Result
Job (RDD) Graph
Stage1St.2
Critical Reflection
• Benefits:
- High level API
- Support more applications types
- Performance optimizations
• Limitations:
- Detailed performance analysis on the thread level is hard
- Multipurpose application support makes performance improvements and
tuning really challenging
- The shuffling phase introduces overheads
Conclusion
• Clusters provide the computational power to
process Big Data
• MapReduce allows developers to build programs for
clusters
• Spark tries to overcome limitations of MapReduce
• These systems introduce many challenges in terms
of measuring and improving their performance

Distributed Processing Frameworks

  • 1.
  • 2.
    Literature • MapReduce: SimplifiedData Processing on Large Clusters Jeff Dean et al. - OSDI’04. • Spark: Cluster Computing with Working Sets M. Zaharia et al. - HotCloud’10.
  • 3.
    Why Big Data? •More data to process: IoT, smart devices, web applications - About 2.3 trillion GB of new data are generated every day • Growth of CPU performance cannot keep up with increasing amount of data to process • This leads us to the Big Data era - Big data: Data sets are so large that the processing power of a single machine is inadequate to deal with them • We need to find ways to process these massive amounts of data
  • 4.
    MapReduce • Proposed byJeff Dean et al. (Google) 2004 - Cited more than 18k • A programming model that enables the parallel and distributed processing of large data sets • Typical MapReduce Program: - Read Data - Map: filtering of the data - Shuffle and short - Reduce: summary operation on data - Write the Results ReduceReduce Input Data 1/3 Input 1/3 Input 1/3 Input Map Map Map Interm. Data Interm. Data Interm. Data Output Data Output Data
  • 5.
    Critical Reflection • Outcome: -Novel idea that lead to a whole new era of distributed systems - Big impact in industry (Hadoop MapReduce) - Lowered the cost of computations • Limitations: - Restricted to batch processing - It only support map and reduce operations - The shuffling phase introduces overheads
  • 6.
    Spark • Proposed byMatei Zaharia et al. 2010 - Cited 1.5k • Another programming model based on higher-ordered functions that execute user-defined functions in parallel • Aims to replace MapReduce in industry • Main Ideas: - Represent the computations as DAGs - Cache datasets into memory
  • 7.
    Spark Model • ResilientDistributed Datasets (RRDs): immutable collections of objects spread across a cluster • Operations over RDDs: 1.Transformations: lazy operators that create new RDDs 2.Actions: launch a computation on an RDD Pipelined RDD1 var count = readFile(…) .map(…) .filter(..) .reduceByKey() .count() File splited into chunks (RDD0) RDD2 RDD3 RDD4 Result Job (RDD) Graph Stage1St.2
  • 8.
    Critical Reflection • Benefits: -High level API - Support more applications types - Performance optimizations • Limitations: - Detailed performance analysis on the thread level is hard - Multipurpose application support makes performance improvements and tuning really challenging - The shuffling phase introduces overheads
  • 9.
    Conclusion • Clusters providethe computational power to process Big Data • MapReduce allows developers to build programs for clusters • Spark tries to overcome limitations of MapReduce • These systems introduce many challenges in terms of measuring and improving their performance

Editor's Notes

  • #9 HL API - (in Scala, Java, Python) - usable by non computer scientists SMAT - (streaming, iterative and interactive) PO - (memory caching, transformation pipelining etc.)
  • #10 3* (in terms of performance, application support and user friendliness)