RDD

Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing
Matei Zaharia, Mosharaf Chowdhury...
2012 University of California, Berkeley

OUTLINE
• Introduction
• Resilient Distributed Datasets (RDDs)
• Representing RDDs
• Evaluation
• Conclusion

Introduction
Cluster computing frameworks like MapReduce is not
well in iterative machine learning and graph algorithms
because data replication,disk I/O,serialization

Introduction
Pregel is a system for iterative graph computations that
keeps intermediate data in memory, while HaLoop
offers an iterative MapReduce interface.
but only support specific computation patterns
They do not provide abstractions for more general
reuse.

Introduction
RDD is defining a programming interface that can
provide fault tolerance efficiently
RDD v.s distributed shared memory
coarse-grained transformations
(e.g., map, filter and join)
fine-grained updates to mutable state
lineage

Resilient Distributed
Datasets (RDDs)
RDD’s transformation are lazy operations that define a
new RDD, while actions launch a computation to
return a value to the program or write data to external
storage.

Datasets (RDDs)

Datasets (RDDs)
RDD is a read-only, partitioned collection of records,
only be created (1) data in stable storage (2) other
RDDs.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.count()

Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
Long
number = errors.count()
RDD1 RDD2
Long
tranformation action

Datasets (RDDs)
DEMO

Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
RDD3
error = errors.persist() or cache()
RDD3 error will in memory

Datasets (RDDs)
Lineage: fault tolerance
if RDD2 lost
tranformation action
RDD1 RDD2 Long
recompute RDD1 and produce new RDD2

Datasets (RDDs)
Spark provides the RDD abstraction through a
language-integrated API
scala
a functional programming language for the Java VM

Representing RDDs
dependencies between RDDs
narrow dependencies：allow for pipelined execution on
one cluster node
wide dependencies：require data from all parent
partitions to be available and to be shuffled across the
nodes using a MapReduce-like operation

Representing RDDs
in same node in different node

Representing RDDs
how spark compute job stages
partition
RDD
RDD in memory

Datasets (RDDs)
Each stage contains as many pipelined transformations
with narrow dependencies as possible.
because avoid shuffled across the nodes

Evaluation
Amazon：m1.xlarge EC2 nodes with 4 cores and
15 GB of RAM. We used HDFS for storage, with
256 MB blocks.

Evaluation
10 iterations on 100 GB datasets using 25–100
machines.
logistic regression k-means
logistic regression is less compute-intensive and thus more
sensitive to time spent in deserialization and I/O.

Evaluation
HadoopBinMem：convert input data to binary format,in memory

Evaluation
pagerank
54 GB Wikipedia dump, 4 million articles.
iterations :10

Evaluation
pagerank iterations :10

Evaluation
fault recovery
k-means
100GB data,75 node ,iterations :10
one node fail at the start of the 6th iteration.

Evaluation
k-means 100GB data 75 node iterations :10

Evaluation
Behavior with Insufficient Memory
logistic regression
100GB data , 25machine

Evaluation
k-means 100GB data 25machine

Conclusion
RDDs,an efficient, general-purpose and fault-tolerant
abstraction for sharing data in cluster applications.
RDDs offer an API based on coarse- grained
transformations that lets them recover data efficiently
using lineage.
Spark v.s Hadoop fast to 20× in iterative applications and
can be used interactively to query hundreds of gigabytes
of data.

RDD

More Related Content

What's hot

Viewers also liked

Similar to RDD

More from Tien-Yang (Aiden) Wu

Recently uploaded

RDD