SlideShare a Scribd company logo
1 of 63
Download to read offline
RESILIENT DISTRIBUTED DATASETS: A
FAULT-TOLERANT ABSTRACTION FOR
IN-MEMORY CLUSTER COMPUTING
MATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY,
MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA.
NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION
PAPERS WE LOVE AMSTERDAM
AUGUST 13, 2015
@gabriele_modena
(C) PRESENTATION BY GABRIELE MODENA, 2015
About me
• CS.ML
• Data science & predictive modelling
• with a sprinkle of systems work
• Hadoop & c. for data wrangling &
crunching numbers
• … and Spark
(C) PRESENTATION BY GABRIELE MODENA, 2015
(C) PRESENTATION BY GABRIELE MODENA, 2015
We present Resilient Distributed Datasets
(RDDs), a distributed memory abstraction
that lets programmers perform in-memory
computations on large clusters in a fault-
tolerant manner. RDDs are motivated by two
types of applications that current computing
frameworks handle inefficiently: iterative
algorithms and interactive data mining tools.
(C) PRESENTATION BY GABRIELE MODENA, 2015
How
• Review (concepts from) key related work
• RDD + Spark
• Some critiques
(C) PRESENTATION BY GABRIELE MODENA, 2015
Related work
• MapReduce
• Dryad
• Hadoop Distributed FileSystem (HDFS)
• Mesos
(C) PRESENTATION BY GABRIELE MODENA, 2015
What’s an iterative algorithm anyway?
data = input data
w = <target vector>
for i in num_iterations:
for item in data:
update(w)
Multiple input scans
At each iteration,
do something
Update a shared data
structure
(C) PRESENTATION BY GABRIELE MODENA, 2015
HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes file
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node
(C) PRESENTATION BY GABRIELE MODENA, 2015
HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes file
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node
(C) PRESENTATION BY GABRIELE MODENA, 2015
HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes file
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node
(C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
(C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(This,1), (is, 1), (a, 1),
(test., 1), (Yes, 1), (it, 1),
(is, 1)
(C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(This,1), (is, 1), (a, 1),
(test., 1), (Yes, 1), (it, 1),
(is, 1)
(This, 1), (is, 2), (a, 2),
(test, 2), (Yes, 1), (it, 1)
(C) PRESENTATION BY GABRIELE MODENA, 2015
(c) Image from Apache Tez http://tez.apache.org
(C) PRESENTATION BY GABRIELE MODENA, 2015
Critiques to MR and HDFS
• Great when records (and jobs) are independent
• In reality expect data to be shuffled across the
network
• Latency measured in minutes
• Performance hit for iterative methods
• Composability monsters
• Meant for batch workflows
(C) PRESENTATION BY GABRIELE MODENA, 2015
Dryad
• Microsoft paper (2007)
• Inspired Apache Tez
• Generalisation of MapReduce via I/O
pipelining
• Applications are (direct acyclic) graphs
of tasks
(C) PRESENTATION BY GABRIELE MODENA, 2015
Dryad
DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex)

.addVertex(summerVertex)

.addEdge(new Edge(tokenizerVertex,

summerVertex,

edgeConf.createDefaultEdgeProperty())

);
(C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce and Dryad
SELECT a.country, COUNT(b.place_id)
FROM place a JOIN tweets b ON (a. place_id = b.place_id)
GROUP BY a.country;
(c) Image from Apache Tez http://tez.apache.org. Modified.
(C) PRESENTATION BY GABRIELE MODENA, 2015
Critiques to Dryad
• No explicit abstraction for data sharing
• Must express data reps as DAG
• Partial solution: DryadLINQ
• No notion of a distributed filesystem
• How to handle large inputs?
• Local writes / remote reads?
(C) PRESENTATION BY GABRIELE MODENA, 2015
Resilient Distributed Datasets
Read-only, partitioned collection of records

=> a distributed immutable array


accessed via coarse-grained transformations
=> apply a function (scala closure) to all

elements of the array
Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj
(C) PRESENTATION BY GABRIELE MODENA, 2015
Resilient Distributed Datasets
Read-only, partitioned collection of records

=> a distributed immutable array


accessed via coarse-grained transformations
=> apply a function (scala closure) to all

elements of the array
Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj
(C) PRESENTATION BY GABRIELE MODENA, 2015
Spark
• Transformations - lazily create RDDs

wc = dataset.flatMap(tokenize)

.reduceByKey(add)
• Actions - execute computation

wc.collect()
Runtime and API
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTPDriver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
• Closures are deserialised
on each worker and
applied to the RDD
(partition)
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
• Closures are deserialised
on each worker and
applied to the RDD
(partition)
• Mesos takes care of
resource management
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
(C) PRESENTATION BY GABRIELE MODENA, 2015
Data persistance
1. in memory as deserialized java object
2. in memory as serialized data
3. on disk
RDD Checkpointing
Memory management via LRU eviction policy
.persist() RDD for future reuse
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
hdfs errors
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
hdfs errors
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
map(_.split(’t’)(3))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
hdfs errors
time fields
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
map(_.split(’t’)(3))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
hdfs errors
time fields
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
map(_.split(’t’)(3))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
lines
errors
hdfs errors
time fields
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
map(_.split(’t’)(3))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
Fault recovery
If a partition is lost,
derived it back from the
lineage
lines
errors
hdfs errors
time fields
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
map(_.split(’t’)(3))
(C) PRESENTATION BY GABRIELE MODENA, 2015
Representation
Challenge: track lineage across
transformations
1. Partitions
2. Data locality for partition p
3. List dependencies
4. Iterator function to compute a dataset
based on its parents
5. Metadata for the partitioner scheme
(C) PRESENTATION BY GABRIELE MODENA, 2015
Narrow dependencies
pipelined execution on one cluster node
map, filter
union
(C) PRESENTATION BY GABRIELE MODENA, 2015
Wide dependencies
require data from all parent partitions to be available and to be
shuffled across the nodes using a MapReduce-like operation
groupByKey
join with inputs
not co-partitioned
(C) PRESENTATION BY GABRIELE MODENA, 2015
Scheduling
Task are allocated based on data locality (delayed scheduling)
1. Action is triggered => compute the RDD
2. Based on lineage, build a graph of stages to execute
3. Each stage contains as many pipelined
transformations with narrow dependencies as
possible
4. Launch tasks to compute missing partitions from
each stage until it has computed the target RDD
5. If a task fails => re-run it on another node as long as
its stage’s parents are still available.
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
union
map
groupBy
join
B
C D
E
F
G
Stage 3Stage 2
A
Stage 1
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
union
map
groupBy
join
B
C D
E
F
G
Stage 3Stage 2
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
G
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
join
B
F
G
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
Stage 2
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
Stage 3Stage 2
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
(C) PRESENTATION BY GABRIELE MODENA, 2015
Evaluation
(C) PRESENTATION BY GABRIELE MODENA, 2015
Some critiques (to the paper) Some critiques
(to the paper)
• How general is this approach?
• We are still doing MapReduce
• Concerns wrt iterative algorithms still stand
• CPU bound workloads?
• Linear Algebra?
• How much tuning is required?
• How does the partitioner work?
• What is the cost of reconstructing an RDD from
lineage?
• Performance when data does not fit in memory
• Eg. a join between two very large non co-
partitioned RDDs
(C) PRESENTATION BY GABRIELE MODENA, 2015
References (Theory)
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing.
Zaharia et. al, Proceedings of NSDI’12. https://www.cs.berkeley.edu/~matei/papers/2012/
nsdi_spark.pdf
Spark: cluster computing with working sets. Zaharia et. al, Proceedings of HotCloud'10.
http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
The Google File System. Ghemawat, Gobioff, Leung, 19th ACM Symposium on Operating
Systems Principles, 2003. http://research.google.com/archive/gfs.html
MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat,
OSDI'04: Sixth Symposium on Operating System Design and Implementation.
http://research.google.com/archive/mapreduce.html
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007.
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Mesos: a platform for fine-grained resource sharing in the data center, Hindman et. al,
Proceedings of NSDI’11. https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
(C) PRESENTATION BY GABRIELE MODENA, 2015
References (Practice)
• An overview of the pyspark API through pictures https://github.com/jkthompson/
pyspark-pictures
• Barry Brumitt’s presentation on MapReduce design patterns (UW CSE490)
http://courses.cs.washington.edu/courses/cse490h/08au/lectures/
MapReduceDesignPatterns-UW2.pdf
• The Dryad Project http://research.microsoft.com/en-us/projects/dryad/
• Apache Spark http://spark.apache.org
• Apache Hadoop https://hadoop.apache.org
• Apache Tez https://tez.apache.org
• Apache Mesos http://mesos.apache.org

More Related Content

What's hot

Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
Dan Han
 

What's hot (20)

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 

Viewers also liked

Viewers also liked (20)

Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 

Similar to Resilient Distributed Datasets

Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Christophe Debruyne
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
airbots
 

Similar to Resilient Distributed Datasets (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsGenerating Executable Mappings from RDF Data Cube Data Structure Definitions
Generating Executable Mappings from RDF Data Cube Data Structure Definitions
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
 
Introduction to HADOOP
Introduction to HADOOPIntroduction to HADOOP
Introduction to HADOOP
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Stratio big data spain
Stratio   big data spainStratio   big data spain
Stratio big data spain
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Recently uploaded (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Resilient Distributed Datasets

  • 1. RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY, MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA. NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION PAPERS WE LOVE AMSTERDAM AUGUST 13, 2015 @gabriele_modena
  • 2. (C) PRESENTATION BY GABRIELE MODENA, 2015 About me • CS.ML • Data science & predictive modelling • with a sprinkle of systems work • Hadoop & c. for data wrangling & crunching numbers • … and Spark
  • 3. (C) PRESENTATION BY GABRIELE MODENA, 2015
  • 4. (C) PRESENTATION BY GABRIELE MODENA, 2015 We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault- tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.
  • 5. (C) PRESENTATION BY GABRIELE MODENA, 2015 How • Review (concepts from) key related work • RDD + Spark • Some critiques
  • 6. (C) PRESENTATION BY GABRIELE MODENA, 2015 Related work • MapReduce • Dryad • Hadoop Distributed FileSystem (HDFS) • Mesos
  • 7. (C) PRESENTATION BY GABRIELE MODENA, 2015 What’s an iterative algorithm anyway? data = input data w = <target vector> for i in num_iterations: for item in data: update(w) Multiple input scans At each iteration, do something Update a shared data structure
  • 8. (C) PRESENTATION BY GABRIELE MODENA, 2015 HDFS • GFS paper (2003) • Distributed storage (with replication) • Block ops • NameNode hashes file locations (blocks) Data Node Data Node Data Node Name Node
  • 9. (C) PRESENTATION BY GABRIELE MODENA, 2015 HDFS • GFS paper (2003) • Distributed storage (with replication) • Block ops • NameNode hashes file locations (blocks) Data Node Data Node Data Node Name Node
  • 10. (C) PRESENTATION BY GABRIELE MODENA, 2015 HDFS • GFS paper (2003) • Distributed storage (with replication) • Block ops • NameNode hashes file locations (blocks) Data Node Data Node Data Node Name Node
  • 11. (C) PRESENTATION BY GABRIELE MODENA, 2015 MapReduce • Google paper (2004) • Apache Hadoop (~2007) • Divide and conquer functional model • Goes hand-in-hand with HDFS • Structure data as (key, value) 1. Map(): filter and project emit (k, v) pairs 2. Reduce(): aggregate and summarise group by key and count Map Map Map Reduce Reduce HDFS (blocks) HDFS
  • 12. (C) PRESENTATION BY GABRIELE MODENA, 2015 MapReduce • Google paper (2004) • Apache Hadoop (~2007) • Divide and conquer functional model • Goes hand-in-hand with HDFS • Structure data as (key, value) 1. Map(): filter and project emit (k, v) pairs 2. Reduce(): aggregate and summarise group by key and count Map Map Map Reduce Reduce HDFS (blocks) HDFS This is a test Yes it is a test …
  • 13. (C) PRESENTATION BY GABRIELE MODENA, 2015 MapReduce • Google paper (2004) • Apache Hadoop (~2007) • Divide and conquer functional model • Goes hand-in-hand with HDFS • Structure data as (key, value) 1. Map(): filter and project emit (k, v) pairs 2. Reduce(): aggregate and summarise group by key and count Map Map Map Reduce Reduce HDFS (blocks) HDFS This is a test Yes it is a test … (This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1)
  • 14. (C) PRESENTATION BY GABRIELE MODENA, 2015 MapReduce • Google paper (2004) • Apache Hadoop (~2007) • Divide and conquer functional model • Goes hand-in-hand with HDFS • Structure data as (key, value) 1. Map(): filter and project emit (k, v) pairs 2. Reduce(): aggregate and summarise group by key and count Map Map Map Reduce Reduce HDFS (blocks) HDFS This is a test Yes it is a test … (This,1), (is, 1), (a, 1), (test., 1), (Yes, 1), (it, 1), (is, 1) (This, 1), (is, 2), (a, 2), (test, 2), (Yes, 1), (it, 1)
  • 15. (C) PRESENTATION BY GABRIELE MODENA, 2015 (c) Image from Apache Tez http://tez.apache.org
  • 16. (C) PRESENTATION BY GABRIELE MODENA, 2015 Critiques to MR and HDFS • Great when records (and jobs) are independent • In reality expect data to be shuffled across the network • Latency measured in minutes • Performance hit for iterative methods • Composability monsters • Meant for batch workflows
  • 17. (C) PRESENTATION BY GABRIELE MODENA, 2015 Dryad • Microsoft paper (2007) • Inspired Apache Tez • Generalisation of MapReduce via I/O pipelining • Applications are (direct acyclic) graphs of tasks
  • 18. (C) PRESENTATION BY GABRIELE MODENA, 2015 Dryad DAG dag = new DAG("WordCount"); dag.addVertex(tokenizerVertex)
 .addVertex(summerVertex)
 .addEdge(new Edge(tokenizerVertex,
 summerVertex,
 edgeConf.createDefaultEdgeProperty())
 );
  • 19. (C) PRESENTATION BY GABRIELE MODENA, 2015 MapReduce and Dryad SELECT a.country, COUNT(b.place_id) FROM place a JOIN tweets b ON (a. place_id = b.place_id) GROUP BY a.country; (c) Image from Apache Tez http://tez.apache.org. Modified.
  • 20. (C) PRESENTATION BY GABRIELE MODENA, 2015 Critiques to Dryad • No explicit abstraction for data sharing • Must express data reps as DAG • Partial solution: DryadLINQ • No notion of a distributed filesystem • How to handle large inputs? • Local writes / remote reads?
  • 21. (C) PRESENTATION BY GABRIELE MODENA, 2015 Resilient Distributed Datasets Read-only, partitioned collection of records
 => a distributed immutable array 
 accessed via coarse-grained transformations => apply a function (scala closure) to all
 elements of the array Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj
  • 22. (C) PRESENTATION BY GABRIELE MODENA, 2015 Resilient Distributed Datasets Read-only, partitioned collection of records
 => a distributed immutable array 
 accessed via coarse-grained transformations => apply a function (scala closure) to all
 elements of the array Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj
  • 23. (C) PRESENTATION BY GABRIELE MODENA, 2015 Spark • Transformations - lazily create RDDs
 wc = dataset.flatMap(tokenize)
 .reduceByKey(add) • Actions - execute computation
 wc.collect() Runtime and API
  • 24. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications Driver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 25. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications • Driver code defines RDDs and invokes actions Driver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 26. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications • Driver code defines RDDs and invokes actions • Submit to long lived workers, that store partitions in memory Driver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 27. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications • Driver code defines RDDs and invokes actions • Submit to long lived workers, that store partitions in memory • Scala closures are serialised as Java objects and passed across the network over HTTPDriver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 28. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications • Driver code defines RDDs and invokes actions • Submit to long lived workers, that store partitions in memory • Scala closures are serialised as Java objects and passed across the network over HTTP • Variables bound to the closure are saved in the serialised object Driver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 29. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications • Driver code defines RDDs and invokes actions • Submit to long lived workers, that store partitions in memory • Scala closures are serialised as Java objects and passed across the network over HTTP • Variables bound to the closure are saved in the serialised object • Closures are deserialised on each worker and applied to the RDD (partition) Driver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 30. (C) PRESENTATION BY GABRIELE MODENA, 2015 Applications • Driver code defines RDDs and invokes actions • Submit to long lived workers, that store partitions in memory • Scala closures are serialised as Java objects and passed across the network over HTTP • Variables bound to the closure are saved in the serialised object • Closures are deserialised on each worker and applied to the RDD (partition) • Mesos takes care of resource management Driver Worker Worker Worker input data input data input data RAM RAM results tasks RAM
  • 31. (C) PRESENTATION BY GABRIELE MODENA, 2015 Data persistance 1. in memory as deserialized java object 2. in memory as serialized data 3. on disk RDD Checkpointing Memory management via LRU eviction policy .persist() RDD for future reuse
  • 32. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect()
  • 33. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect()
  • 34. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR"))
  • 35. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR"))
  • 36. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”))
  • 37. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors hdfs errors lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”))
  • 38. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors hdfs errors lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”)) map(_.split(’t’)(3))
  • 39. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors hdfs errors time fields lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”)) map(_.split(’t’)(3))
  • 40. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors hdfs errors time fields lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”)) map(_.split(’t’)(3))
  • 41. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage lines errors hdfs errors time fields lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”)) map(_.split(’t’)(3))
  • 42. (C) PRESENTATION BY GABRIELE MODENA, 2015 Lineage Fault recovery If a partition is lost, derived it back from the lineage lines errors hdfs errors time fields lines = spark.textFile(“hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist() errors.filter(_.contains("HDFS")) .map(_.split(’t’)(3)) .collect() filter(_.startsWith("ERROR")) filter(_.contains(“HDFS”)) map(_.split(’t’)(3))
  • 43. (C) PRESENTATION BY GABRIELE MODENA, 2015 Representation Challenge: track lineage across transformations 1. Partitions 2. Data locality for partition p 3. List dependencies 4. Iterator function to compute a dataset based on its parents 5. Metadata for the partitioner scheme
  • 44. (C) PRESENTATION BY GABRIELE MODENA, 2015 Narrow dependencies pipelined execution on one cluster node map, filter union
  • 45. (C) PRESENTATION BY GABRIELE MODENA, 2015 Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation groupByKey join with inputs not co-partitioned
  • 46. (C) PRESENTATION BY GABRIELE MODENA, 2015 Scheduling Task are allocated based on data locality (delayed scheduling) 1. Action is triggered => compute the RDD 2. Based on lineage, build a graph of stages to execute 3. Each stage contains as many pipelined transformations with narrow dependencies as possible 4. Launch tasks to compute missing partitions from each stage until it has computed the target RDD 5. If a task fails => re-run it on another node as long as its stage’s parents are still available.
  • 47. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution union map groupBy join B C D E F G Stage 3Stage 2 A Stage 1
  • 48. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution union map groupBy join B C D E F G Stage 3Stage 2 A Stage 1 B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 49. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution G B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 50. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution join B F G B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 51. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution join B F G groupBy A B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 52. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution union D E join B F G groupBy A B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 53. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G groupBy A B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 54. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G groupBy A B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 55. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G groupBy A B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 56. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G groupBy A B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 57. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G groupBy A Stage 1 B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 58. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G Stage 2 groupBy A Stage 1 B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 59. (C) PRESENTATION BY GABRIELE MODENA, 2015 Job execution map C union D E join B F G Stage 3Stage 2 groupBy A Stage 1 B = A.groupBy D = C.map F = D.union(E) G = B.join(F) G.collect()
  • 60. (C) PRESENTATION BY GABRIELE MODENA, 2015 Evaluation
  • 61. (C) PRESENTATION BY GABRIELE MODENA, 2015 Some critiques (to the paper) Some critiques (to the paper) • How general is this approach? • We are still doing MapReduce • Concerns wrt iterative algorithms still stand • CPU bound workloads? • Linear Algebra? • How much tuning is required? • How does the partitioner work? • What is the cost of reconstructing an RDD from lineage? • Performance when data does not fit in memory • Eg. a join between two very large non co- partitioned RDDs
  • 62. (C) PRESENTATION BY GABRIELE MODENA, 2015 References (Theory) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Zaharia et. al, Proceedings of NSDI’12. https://www.cs.berkeley.edu/~matei/papers/2012/ nsdi_spark.pdf Spark: cluster computing with working sets. Zaharia et. al, Proceedings of HotCloud'10. http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf The Google File System. Ghemawat, Gobioff, Leung, 19th ACM Symposium on Operating Systems Principles, 2003. http://research.google.com/archive/gfs.html MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation. http://research.google.com/archive/mapreduce.html Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007. http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf Mesos: a platform for fine-grained resource sharing in the data center, Hindman et. al, Proceedings of NSDI’11. https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
  • 63. (C) PRESENTATION BY GABRIELE MODENA, 2015 References (Practice) • An overview of the pyspark API through pictures https://github.com/jkthompson/ pyspark-pictures • Barry Brumitt’s presentation on MapReduce design patterns (UW CSE490) http://courses.cs.washington.edu/courses/cse490h/08au/lectures/ MapReduceDesignPatterns-UW2.pdf • The Dryad Project http://research.microsoft.com/en-us/projects/dryad/ • Apache Spark http://spark.apache.org • Apache Hadoop https://hadoop.apache.org • Apache Tez https://tez.apache.org • Apache Mesos http://mesos.apache.org