Resilient Distributed Datasets: A Fault- 
Tolerant Abstraction for In-Memory Cluster 
Computing 
Matei Zaharia, Mosharaf Chowdhury... 
2012 University of California, Berkeley
OUTLINE 
• Introduction 
• Resilient Distributed Datasets (RDDs) 
• Representing RDDs 
• Evaluation 
• Conclusion
Introduction 
Cluster computing frameworks like MapReduce is not 
well in iterative machine learning and graph algorithms 
because data replication,disk I/O,serialization
Introduction 
Pregel is a system for iterative graph computations that 
keeps intermediate data in memory, while HaLoop 
offers an iterative MapReduce interface. 
but only support specific computation patterns 
They do not provide abstractions for more general 
reuse.
Introduction 
RDD is defining a programming interface that can 
provide fault tolerance efficiently 
RDD v.s distributed shared memory 
coarse-grained transformations 
(e.g., map, filter and join) 
fine-grained updates to mutable state 
lineage
Resilient Distributed 
Datasets (RDDs) 
RDD’s transformation are lazy operations that define a 
new RDD, while actions launch a computation to 
return a value to the program or write data to external 
storage.
Resilient Distributed 
Datasets (RDDs)
Resilient Distributed 
Datasets (RDDs) 
RDD is a read-only, partitioned collection of records, 
only be created (1) data in stable storage (2) other 
RDDs. 
lines = spark.textFile("hdfs://...") 
errors = lines.filter(_.startsWith("ERROR")) 
errors.count()
Resilient Distributed 
Datasets (RDDs) 
RDD1 
lines = spark.textFile(“hdfs://...") 
RDD2 
errors = lines.filter(_.startsWith(“ERROR")) 
Long 
number = errors.count() 
RDD1 RDD2 
Long 
tranformation action
Resilient Distributed 
Datasets (RDDs) 
DEMO
Resilient Distributed 
Datasets (RDDs) 
RDD1 
lines = spark.textFile(“hdfs://...") 
RDD2 
errors = lines.filter(_.startsWith(“ERROR")) 
RDD3 
error = errors.persist() or cache() 
RDD3 error will in memory
Resilient Distributed 
Datasets (RDDs) 
Lineage: fault tolerance 
if RDD2 lost 
tranformation action 
RDD1 RDD2 Long 
recompute RDD1 and produce new RDD2
Resilient Distributed 
Datasets (RDDs) 
Spark provides the RDD abstraction through a 
language-integrated API 
scala 
a functional programming language for the Java VM
Representing RDDs 
dependencies between RDDs 
narrow dependencies:allow for pipelined execution on 
one cluster node 
wide dependencies:require data from all parent 
partitions to be available and to be shuffled across the 
nodes using a MapReduce-like operation
Representing RDDs 
in same node in different node
Representing RDDs 
how spark compute job stages 
partition 
RDD 
RDD in memory
Resilient Distributed 
Datasets (RDDs) 
Each stage contains as many pipelined transformations 
with narrow dependencies as possible. 
because avoid shuffled across the nodes
Evaluation 
Amazon:m1.xlarge EC2 nodes with 4 cores and 
15 GB of RAM. We used HDFS for storage, with 
256 MB blocks.
Evaluation 
10 iterations on 100 GB datasets using 25–100 
machines. 
logistic regression k-means 
logistic regression is less compute-intensive and thus more 
sensitive to time spent in deserialization and I/O.
Evaluation 
HadoopBinMem:convert input data to binary format,in memory
Evaluation 
pagerank 
54 GB Wikipedia dump, 4 million articles. 
iterations :10
Evaluation 
pagerank iterations :10
Evaluation 
fault recovery 
k-means 
100GB data,75 node ,iterations :10 
one node fail at the start of the 6th iteration.
Evaluation 
k-means 100GB data 75 node iterations :10
Evaluation 
Behavior with Insufficient Memory 
logistic regression 
100GB data , 25machine
Evaluation 
k-means 100GB data 25machine
Conclusion 
RDDs,an efficient, general-purpose and fault-tolerant 
abstraction for sharing data in cluster applications. 
RDDs offer an API based on coarse- grained 
transformations that lets them recover data efficiently 
using lineage. 
Spark v.s Hadoop fast to 20× in iterative applications and 
can be used interactively to query hundreds of gigabytes 
of data.

RDD

  • 1.
    Resilient Distributed Datasets:A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley
  • 2.
    OUTLINE • Introduction • Resilient Distributed Datasets (RDDs) • Representing RDDs • Evaluation • Conclusion
  • 3.
    Introduction Cluster computingframeworks like MapReduce is not well in iterative machine learning and graph algorithms because data replication,disk I/O,serialization
  • 4.
    Introduction Pregel isa system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface. but only support specific computation patterns They do not provide abstractions for more general reuse.
  • 5.
    Introduction RDD isdefining a programming interface that can provide fault tolerance efficiently RDD v.s distributed shared memory coarse-grained transformations (e.g., map, filter and join) fine-grained updates to mutable state lineage
  • 6.
    Resilient Distributed Datasets(RDDs) RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.
  • 7.
  • 8.
    Resilient Distributed Datasets(RDDs) RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()
  • 9.
    Resilient Distributed Datasets(RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) Long number = errors.count() RDD1 RDD2 Long tranformation action
  • 10.
  • 11.
    Resilient Distributed Datasets(RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) RDD3 error = errors.persist() or cache() RDD3 error will in memory
  • 12.
    Resilient Distributed Datasets(RDDs) Lineage: fault tolerance if RDD2 lost tranformation action RDD1 RDD2 Long recompute RDD1 and produce new RDD2
  • 13.
    Resilient Distributed Datasets(RDDs) Spark provides the RDD abstraction through a language-integrated API scala a functional programming language for the Java VM
  • 14.
    Representing RDDs dependenciesbetween RDDs narrow dependencies:allow for pipelined execution on one cluster node wide dependencies:require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation
  • 15.
    Representing RDDs insame node in different node
  • 16.
    Representing RDDs howspark compute job stages partition RDD RDD in memory
  • 17.
    Resilient Distributed Datasets(RDDs) Each stage contains as many pipelined transformations with narrow dependencies as possible. because avoid shuffled across the nodes
  • 18.
    Evaluation Amazon:m1.xlarge EC2nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.
  • 19.
    Evaluation 10 iterationson 100 GB datasets using 25–100 machines. logistic regression k-means logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.
  • 20.
    Evaluation HadoopBinMem:convert inputdata to binary format,in memory
  • 21.
    Evaluation pagerank 54GB Wikipedia dump, 4 million articles. iterations :10
  • 22.
  • 23.
    Evaluation fault recovery k-means 100GB data,75 node ,iterations :10 one node fail at the start of the 6th iteration.
  • 24.
    Evaluation k-means 100GBdata 75 node iterations :10
  • 25.
    Evaluation Behavior withInsufficient Memory logistic regression 100GB data , 25machine
  • 26.
  • 27.
    Conclusion RDDs,an efficient,general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage. Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.