Apache Spark

Resilient Distributed Dataset: A Fault-Tolerant
Abstraction For In-Memory Cluster Computing
Mahdi Esmail oghli
Dr. Bagheri
AmirKabir University of technology
SPARK
http://BigData.ceit.aut.ac.ir

A BIGDATA Processing
Framework.
2

Dealing With BigData
4
Sampling
Hashing
Approximation methods
Map-Reduce Model
…

Map-Reduce Model
5
A BigData Programming Model
Potential of parallelism
Can be executed on a cluster

Map-Reduce Model
6
Map
Map
Map
Reduce
Reduce
OutputInput

Problems with current computing frameworks
Specially Map-reduce
Provides abstraction for accessing
the cluster’s computational
resources
Lack of abstraction for distributed
memory
7

Problems with current computing frameworks
Specially Map-reduce
Makes them inefficient for those
that reuse intermediate results
across multiple computations
8

SPARK Motivation
Problems with current computing
frameworks (ex. Map-Reduce):
Iterative algorithms
Interactive data mining tools
9

Data reuse examples
Iterative machine learning and graph
algorithms:
Page Rank
K-means clustering
Logistic regression
10

Data reuse examples
Interactive Data Mining (runs
multiple queries on the same subset
of data) :
Statistical queries
Fraud detection
Stream queries
11

Current Solution
The only way to reuse data between
computations with current frameworks:
Write it to an external stable storage
system. X
12

Map-Reduce Model
13
Map
Map
Map
Reduce
Reduce
OutputInput
Stable Storage

Developed System For
Reusing Intermediate Data
Pregel : Iterative graph computation
HALOOP: Iterative map-reduce interface
14

Developed System For
Reusing Intermediate Data
Just for specific computation patterns.
We need abstraction for more general
reuse.
15

RDD
Resilient Distributed Dataset
16

RDD
Read-Only partitioned collection of
records.
Can be created on either stable
storage or other RDDs (using
transformations).
User can control Persistence and
Partitioning
17

RDD
Efficient data reuse
Parallel data structure
Allow explicit persist results
In-memory computation
Large clusters
fault-tolerant manner
18

Current fault tolerant approaches
Data replication across machines
Log update across machines
19

2 Main interface for RDD
20
RDD
Actions Transformations

-e.g., map, filter and join
Transformations
21
Interface used for fault
tolerance in RDD

Actions
SPARK computes RDDs Lazily (Helps
pipelining)
Actions return value.
ex. Count - Collect - Save
22

RDD can express other
models
Map - Reduce
SQL
Pregel
HALOOP
…
23

24
rdd1.join(rdd2)
.groupby(…)
.filter(…)
join
groupby
filter
Task
Scheduler
Execute
task by
worker

Results
SPARK Runtime
25
Driver
Input
Data
Input
Data
Input
Data
RAM
RAM
RAM
Worker
Worker
Worker
Tasks
Tasks
Tasks
Results
Results

An Example
1. lines = spark.textFile(“hdfs://…”)
2. errors = lines.filter( _.startsWith(“ERROR”) )
3. errors.persist()
4. errors.filter( _.contains(“HDFS”) ).map( _.split(‘t’)[3]).count
26

An Example
27
lines
Errors
HDFS
errors
Part3
filter(_.startsWith(“ERROR”))
filter(_.contains(“HDFS”))
map( _.split(‘t’)[3])

Persistent Function
Indicates which RDDs we want to
reuse in the future actions.
Other persistence strategies like:
Store the RDD only in disk
Replicating across machines
Set persistence priorities to RDDs.
28

SPARK
RDD is been
implemented in a
system called SPARK
In SCALA Language
29

What benchmarks
show about SPARK
20X faster than
HADOOP for
iterative
applications
It can scan 1TB
dataset with 5-7s
latency
30
100 GB Data
100 node

Evaluation(Logistic Regression)
31
0
35
70
105
140
HADOOP HADOOPBM SPARK
3
62
76
46
139
80
First Iteration Later Iterations

Evaluation(K-means)
32
0
50
100
150
200
HADOOP HADOOPBM SPARK
33
87
106
82
182
115
First Iteration Later Iterations

SPARK STACK
33
Apache Spark
Distributed File System. e.x. HDFS, GlusterFS
Spark SQL
Spark
Streaming
MLlib GraphX

SPARK Won Daytona GraySort
Contest 2014
Spark officially sets a new record in large-scale sorting
34

Spark the fastest open source
engine for sorting
35
HADOOP MR SPARK SPARK 1PT
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 physical 6592 virtualized 6080 virtualized
Cluster disk
throughput
3150 GB/s 618 GB/s 570 GB/s
Environment
Dedicated
datacenter
EC2
(i2.8xlarge)
EC2
(i2.8xlarge)
Sort rate 1.42 PT/min 4.27 TB/min 4.27 TB/min
Without using Spark’s in-memory cache

References
38
Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified
data processing on large clusters." Communications of the
ACM 51.1 (2008): 107-113.
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing."
Proceedings of the 9th USENIX conference on Networked
Systems Design and Implementation. USENIX Association,
2012.
http://Spark.apache.org
https://databricks.com

Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to Apache Spark

Similar to Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Apache Spark