2. Motivation
•RDDs are motivated by two types of applications that current computing
frameworks handle inefficiently:
1. Iterative algorithms:
iterative machine learning
graph algorithms
2. Interative data mining
adhoc query
•In MapReduce, the only way to share data across jobs is stable storage
slow!
5. Solution: Resilient
Distributed Datasets (RDDs)
•Restriced form of distributed shared memory
Immutable,partitioned collections of records
Can only be built through coarsegrained derterminstic
transformations(map,filter,join,…)
•Efficient fault recovery using lineage
log one operation to apply to many elenments
Recompute lost partitions on failure
No cost if nonthing fails
6. Solution: Resilient
Distributed Datasets (RDDs)
• Allow apps to keep working sets in memory
for efficient reuse
• Retain the attractive properties of MapReduce
– Fault tolerance, data locality, scalability
• Support a wide range of applications
• Control of each RDD’s partitioning (layout
across nodes) and persistence (storage in
RAM,on disk,etc)
7. RDD Operations
Transformations
(define a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey
8. Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Load error messages from a log into memory, then
interactively search for various patterns
9. 9
Fault Recovery
• RDD track the grapth of transformations that
built them (their lineage) to rebuild lost data
11. Optimizing Placement
links & ranks repeatedly joined
Can co-partition them (e.g.hash
both on URL) to avoid shuffles
Can also use app knowledge,
e.g.,hash on DNS name
links = links.partitionBy(new
URLPartitioner())
13. Representing RDDs
• a set of partitions, which are atomic pieces of the
dataset
• a set of dependencies on parent RDDs
• a function for computing the dataset based on its
parents
• metadata about its partitioning scheme
• data placement
04/25/14
14. Representing RDDs
04/25/14
Operation Meanning
partitions() Return a list of Partition objects
preferredLocations(p) List nodes where partition p can
be accessed faster due to data
locality
dependencies() Return a list of dependencies
iterator(p, parentIters) Compute the elements of
partition p given iterators for its
parent partitions
partitioner() Return metadata specifying
whether the RDD is hash/range
partitioned
Interface used to represent RDDs in Spark
15. Dependencies
• narrow dependencies
---where each partition of the parent RDD is used by
at most one partition of the child RDD
• wide dependencies
---where multiple child partitions may depend on it.
• For example
---map leads to a narrow dependency,
---while join leads to wide dependencies (unless the parents are
hash-partitioned)
04/25/14
17. Narrow VS Wide dependencies
• Narrow dependencies
---allow for pipelined execution on one cluster node, which can compute all the
parent partitions.
---recovery after a node failure is more efficient, as only the lost parent partitions
need to be recomputed, can be recomputed in parallel on different nodes
• Wide dependencies
--- require data from all parent partitions to be available and to be shuffled across
the nodes using a MapReduce-like operation
--- in a lineage graph, a single failed node might cause the loss of some partition
from all the ancestors of an RDD, requiring a complete re-execution
04/25/14
18. Job Scheduler
• Similar to Dryad’s, but takes into account which partitions of persistent
RDDS available in memory
• When runs an action (e.g., count or save) on an RDD, the scheduler
examines that RDD’s lineage graph to build a DAG of stages to execute
• Each stage contains as many pipelined transformations with narrow
dependencies as possible
Boundary of the stages
---shuffle operations required for wide dependencies
---any already computed partitions(shortcircuit the computation of a
parent RDD)
• The scheduler then launches tasks to compute missing partitions from
each stage until it has computed the target RDD
04/25/14
20. Task Assignment
• scheduler assigns tasks to machines based on data locality
using delay scheduling
---if a task needs to process a partition that is available in
memory on a node, then send it to that node
---otherwise, a task processes a partition for which the
containing RDD provides preferred locations (e.g., an HDFS
file), then send it to those
04/25/14
21. Memory Management
• in-memory storage as deserialized Java objects
---The first option provides the fastest performance, because the Java
VM can access each RDD element natively
• in-memory storage as serialized data
---The second option lets users choose a more memory-efficient
representation than Java object graphs when space is limited, at the
cost of lower performance
• on-disk storage
---The third option is useful for RDDs that are too large to keep in RAM
but costly to recompute on each use.
04/25/14
22. Not Suitable for RDDs
• RDDs are best suited for batch applications that apply the same
operation to all elements of a dataset
• RDDs would be less suitable for applications that make asynchronous
fine-grained updates to shared state, such as a storage system for a web
application or an incremental web crawler
04/25/14
24. 04/25/14
Open Source Community
15contributors,5+companies using Spark,
3+applications projects at Berkeley
User applications:
» Data mining 40x faster than Hadoop(Conviva)
» Exploratory log analysis (Foursquare)
» Traffic prediction via EM(Mobile Millennium)
» Twitter spam classification (Monarch)
» DNA sequence analysis(SNAP)
25. 04/25/14
Conclusion
RDDs offer a simple and efficient programming model for a broad range of
Applications(immutable nature and coarse-grained transformations, suitable
for a wide class of applications)
Leverage the coarse-grained nature of many parallel algorithms for low-
overhead recovery
Let user controls each RDD’s partitioning (layout across nodes) and
persistence (storage in RAM,on disk,etc)
Editor's Notes
Key idea: add “variables” to the “functions” in functional programming
Pepieline execution: For example, one can apply a map followed by a filter on an element-by-element basis
Example of how Spark computes job stages.
Boxes with solid outlines are RDDs. Partitions are shaded rectangles,
in black if they are already in memory.
To run an action on RDD G, we build build stages at wide dependencies and pipeline narrow
transformations inside each stage. In this case, stage 1’s
output RDD is already in RAM, so we run stage 2 and then 3.
自己总结:
1.简单 高效 应用范围较广
2.降低了粗粒度并行算法容恢复的代价
3.由用户决定哪些数据是需要重复利用而需要长久保存以及保存的策略,用户可以控制数据分布的策略来避免shuffle以提高效率(如co-partition,shuffle的过程是比较慢,比较耗时间的操作)
4.比一般的模型更通用,现有的模型大多解决的是MapReduce在某些领域性能表现的不好而专门位置设计的专用模型,如Google的Pregel,与之相比,Pregel提供的数据共享模型隐含的适用于图计算的模型,而RDD的模型则提供了一种更通用的数据共享模型(不仅仅能表达出Pregel的计算模型,还能用在其他的应用场景,更通用,更灵活。)
与Pregel的区别:
A third class of systems provide high-level interfaces
for specific classes of applications requiring data sharing.
For example, Pregel [22] supports iterative graph applications,
while Twister [11] and HaLoop [7] are iterative
MapReduce runtimes. However, these frameworks perform
data sharing implicitly for the pattern of computation
they support, and do not provide a general abstraction
that the user can employ to share data of her choice
among operations of her choice. For example, a user cannot
use Pregel or Twister to load a dataset into memory
and then decide what query to run on it. RDDs provide
a distributed storage abstraction explicitly and can thus
support applications that these specialized systems do
not capture, such as interactive data mining.
与 MR的区别(shark论文总结):
1. Like Dryad and Tenzing [17, 9], it supports general computation
DAGs, not just the two-stage MapReduce topology.
2. It provides an in-memory storage abstraction called Resilient
Distributed Datasets (RDDs) that lets applications keep data
in memory across queries, and automatically reconstructs it
after failures [33].
3. The engine is optimized for low latency. It can efficiently
manage tasks as short as 100 milliseconds on clusters of
thousands of cores, while engines like Hadoop incur a latency
of 5–10 seconds to launch each task.
RDD的四个特点(shark论文总结):
The RDD model offers several key benefits our large-scale in memory
computing setting.
First, RDDs can be written at the speed
of DRAM instead of the speed of the network, because there is no
need to replicate each byte written to another machine for fault tolerance.
DRAM in a modern server is over 10 faster than even a
10-Gigabit network.
Second, Spark can keep just one copy of each
RDD partition in memory, saving precious memory over a replicated
system, since it can always recover lost data using lineage.
Third, when a node fails, its lost RDD partitions can be rebuilt in
parallel across the other nodes, allowing speedy recovery.
Fourth,even if a node is just slow (a “straggler”), we can recompute necessary
partitions on other nodes because RDDs are immutable so
there are no consistency concerns with having two copies of a partition.