Spark’s distributed programming model
Martin Zapletal Cake Solutions
Apache Spark
Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model
Table of Contents
● Distributed programming introduction
● Programming models
● Datafow systems and DAGs
● RDD
● Transformations, Actions, Persistence, Shared variables
Distributed programming
● reminder
○ unreliable network
○ ubiquitous failures
○ everything asynchronous
○ consistency, ordering and synchronisation expensive
○ local time
○ correctness properties safety and liveness
○ ...
Two armies (generals)
● two armies, A (Red) and B (Blue)
● separated parts A1 and A2 of A army must synchronize attack to win
● consensus with unreliable communication channel
● no node failures, no byzantine failures, …
● designated leader
Parallel programming models
● Parallel computing models
○ Different parallel computing problems
■ Easily parallelizable or communication needed
○ Shared memory
■ On one machine
● Multiple CPUs/GPUs share memory
■ On multiple machines
● Shared memory accessed via network
● Still much slower compared to memory
■ OpenMP, Global Arrays, …
○ Share nothing
■ Processes communicate by sending messages
■ Send(), Receive()
■ MPI
○ usually no fault tolerance
Dataflow system
● term used to describe general parallel programming approach
● in traditional von Neumann architecture instructions executed sequentially by a
worker (cpu) and data do not move
● in Dataflow workers have different tasks assigned to them and form an assembly
line
● program represented by connections and black box operations - directed graph
● data moves between tasks
● task executed by worker as soon as inputs available
● inherently parallel
● no shared state
● closer to functional programming
● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
MapReduce
● shows that Dataflow can be expressed in terms of map and reduce
operations
● simple to parallelize
● but each map-reduce is separate from the rest
Directed acyclic graph
● Spark is a Dataflow execution engine that supports cyclic data flows
● whole DAG is formed lazily
● allows global optimizations
● has expresiveness of MPI
● lineage tracking
Optimizations
● similar to optimizations of RDBMS (operation reordering, bushy
join-order enumeration, aggregation push-down)
● however DAGs less restrictive than database queries and it is
difficult to optimize UDFs (higher order functions used in Spark,
Flink)
● potentially major performance improvement
● partially support for incremental algorithm optimization (local
change) with sparse computational dependencies (GraphX)
Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.age < 35)
sc
.parallelize(people)
.filter(_.age < 35)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.height < 170)
sc
.parallelize(people)
.filter(_.height < 170)
.map(p => Person(p.age, p.height * 2.54))
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
???
Optimizations
1. logical rewriting applying rules to trees of operators (e.g. filter push down)
○ static code analysis (bytecode of each UDF) to check reordering rules
○ emits all valid reordered data flow alternatives
2. logical representation translated to physical representation
○ chooses physical execution strategies for each alternative (partitioning,
broadcasting, external sorts, merge and hash joins, …)
○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
Stream optimizations
● similar, because in Spark streams are just mini batches
● a few extra window, state operations
pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1))
counts = ones.runningReduce((a, b) => a + b)
Performance
Hadoop Spark Spark
Data size 102.5 TB 100 TB 1000 TB
Time [min] 72 23 234
Nodes 2100 206 190
Cores 50400 6592 6080
Rate/node [GB/min] 0.67 20.7 22.5
Environment dedicated data center EC2 EC2
● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http:
//sortbenchmark.org/)
● required some improvements in shuffle approach
● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …)
● Databricks blog + presentation
Spark programming model
● RDD
● parallelizing collections
● loading external datasets
● operations
○ transformations
○ actions
● persistence
● shared variables
RDD
● transformations
○ lazy, form the DAG
○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union,
intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup,
repatition, cartesian, glom, ...
● actions
○ execute DAG
○ retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● different categories of transformations with different complexity, performance and
sematics
● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning
● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.
rdd.RDD
Transformations with narrow deps
● map
● union
● join with copartitioned inputs
Transformations with wide deps
● groupBy
● join without copartitioned inputs
Actions collect
● retrieves result to driver program
● no longer distributed
Actions reduction
● associative, commutative operation
Cache
● cache partitions to be reused in next actions on it or on datasets derived
from it
● snapshot used instead of lineage recomputation
● fault tolerant
● cache(), persist()
● levels
○ memory
○ disk
○ both
○ serialized
○ replicated
○ off-heap
● automatic cache after shuffle
Shared variables - broadcast
● usually all variables used in UDF are copies on each node
● shared r/w variables would be very inefficient
● broadcast
○ read only variables
○ efficient broadcast algorithm, can deliver data cheaply to all nodes
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
Shared variables - accumulators
● accumulators
○ add only
○ use associative operation so efficient in parallel
○ only driver program can read the value
○ exactly once semantics only guaranteed for actions (in case of failure
and recalculation)
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value
Shared variables - accumulators
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}
Conclusion
● expressive and abstract programming model
● user defined functions
● based on research
● optimizations
● constraining in certain cases (spanning partition boundaries, functions of
multiple variables, ...)
Questions

Apache spark - Spark's distributed programming model

  • 1.
    Spark’s distributed programmingmodel Martin Zapletal Cake Solutions Apache Spark
  • 2.
    Apache Spark andBig Data 1) History and market overview 2) Installation 3) MLlib and machine learning on Spark 4) Porting R code to Scala and Spark 5) Concepts - Core, SQL, GraphX, Streaming 6) Spark’s distributed programming model
  • 3.
    Table of Contents ●Distributed programming introduction ● Programming models ● Datafow systems and DAGs ● RDD ● Transformations, Actions, Persistence, Shared variables
  • 4.
    Distributed programming ● reminder ○unreliable network ○ ubiquitous failures ○ everything asynchronous ○ consistency, ordering and synchronisation expensive ○ local time ○ correctness properties safety and liveness ○ ...
  • 5.
    Two armies (generals) ●two armies, A (Red) and B (Blue) ● separated parts A1 and A2 of A army must synchronize attack to win ● consensus with unreliable communication channel ● no node failures, no byzantine failures, … ● designated leader
  • 6.
    Parallel programming models ●Parallel computing models ○ Different parallel computing problems ■ Easily parallelizable or communication needed ○ Shared memory ■ On one machine ● Multiple CPUs/GPUs share memory ■ On multiple machines ● Shared memory accessed via network ● Still much slower compared to memory ■ OpenMP, Global Arrays, … ○ Share nothing ■ Processes communicate by sending messages ■ Send(), Receive() ■ MPI ○ usually no fault tolerance
  • 7.
    Dataflow system ● termused to describe general parallel programming approach ● in traditional von Neumann architecture instructions executed sequentially by a worker (cpu) and data do not move ● in Dataflow workers have different tasks assigned to them and form an assembly line ● program represented by connections and black box operations - directed graph ● data moves between tasks ● task executed by worker as soon as inputs available ● inherently parallel ● no shared state ● closer to functional programming ● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)
  • 8.
    MapReduce ● shows thatDataflow can be expressed in terms of map and reduce operations ● simple to parallelize ● but each map-reduce is separate from the rest
  • 9.
    Directed acyclic graph ●Spark is a Dataflow execution engine that supports cyclic data flows ● whole DAG is formed lazily ● allows global optimizations ● has expresiveness of MPI ● lineage tracking
  • 10.
    Optimizations ● similar tooptimizations of RDBMS (operation reordering, bushy join-order enumeration, aggregation push-down) ● however DAGs less restrictive than database queries and it is difficult to optimize UDFs (higher order functions used in Spark, Flink) ● potentially major performance improvement ● partially support for incremental algorithm optimization (local change) with sparse computational dependencies (GraphX)
  • 11.
    Optimizations sc .parallelize(people) .map(p => Person(p.age,p.height * 2.54)) .filter(_.age < 35) sc .parallelize(people) .filter(_.age < 35) .map(p => Person(p.age, p.height * 2.54)) case class Person(age: Int, height: Double) val people = (0 to 100).map(x => Person(x, x))
  • 12.
    Optimizations sc .parallelize(people) .map(p => Person(p.age,p.height * 2.54)) .filter(_.height < 170) sc .parallelize(people) .filter(_.height < 170) .map(p => Person(p.age, p.height * 2.54)) case class Person(age: Int, height: Double) val people = (0 to 100).map(x => Person(x, x)) ???
  • 13.
    Optimizations 1. logical rewritingapplying rules to trees of operators (e.g. filter push down) ○ static code analysis (bytecode of each UDF) to check reordering rules ○ emits all valid reordered data flow alternatives 2. logical representation translated to physical representation ○ chooses physical execution strategies for each alternative (partitioning, broadcasting, external sorts, merge and hash joins, …) ○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)
  • 14.
    Stream optimizations ● similar,because in Spark streams are just mini batches ● a few extra window, state operations pageViews = readStream("http://...", "1s") ones = pageViews.map(event => (event.url, 1)) counts = ones.runningReduce((a, b) => a + b)
  • 15.
    Performance Hadoop Spark Spark Datasize 102.5 TB 100 TB 1000 TB Time [min] 72 23 234 Nodes 2100 206 190 Cores 50400 6592 6080 Rate/node [GB/min] 0.67 20.7 22.5 Environment dedicated data center EC2 EC2 ● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http: //sortbenchmark.org/) ● required some improvements in shuffle approach ● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …) ● Databricks blog + presentation
  • 16.
    Spark programming model ●RDD ● parallelizing collections ● loading external datasets ● operations ○ transformations ○ actions ● persistence ● shared variables
  • 17.
    RDD ● transformations ○ lazy,form the DAG ○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union, intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup, repatition, cartesian, glom, ... ● actions ○ execute DAG ○ retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ... ● different categories of transformations with different complexity, performance and sematics ● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning ● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark. rdd.RDD
  • 18.
    Transformations with narrowdeps ● map ● union ● join with copartitioned inputs
  • 19.
    Transformations with widedeps ● groupBy ● join without copartitioned inputs
  • 20.
    Actions collect ● retrievesresult to driver program ● no longer distributed
  • 21.
  • 22.
    Cache ● cache partitionsto be reused in next actions on it or on datasets derived from it ● snapshot used instead of lineage recomputation ● fault tolerant ● cache(), persist() ● levels ○ memory ○ disk ○ both ○ serialized ○ replicated ○ off-heap ● automatic cache after shuffle
  • 23.
    Shared variables -broadcast ● usually all variables used in UDF are copies on each node ● shared r/w variables would be very inefficient ● broadcast ○ read only variables ○ efficient broadcast algorithm, can deliver data cheaply to all nodes val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value
  • 24.
    Shared variables -accumulators ● accumulators ○ add only ○ use associative operation so efficient in parallel ○ only driver program can read the value ○ exactly once semantics only guaranteed for actions (in case of failure and recalculation) val accum = sc.accumulator(0, "My Accumulator") sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value
  • 25.
    Shared variables -accumulators object VectorAccumulatorParam extends AccumulatorParam[Vector] { def zero(initialValue: Vector): Vector = { Vector.zeros(initialValue.size) } def addInPlace(v1: Vector, v2: Vector): Vector = { v1 += v2 } }
  • 26.
    Conclusion ● expressive andabstract programming model ● user defined functions ● based on research ● optimizations ● constraining in certain cases (spanning partition boundaries, functions of multiple variables, ...)
  • 27.

Editor's Notes

  • #5 anything can fail (network, nodes, lost or damaged packets, …) Liveness properties : assert that something ‘good’ will eventually happen during execution. Safety Properties : assert that nothing ‘bad’ will ever happen during an execution (that is, that the program will never enter a ‘bad’ state).
  • #7 HPC shared memory may or may not be good depends on communication patterns locks may be needed
  • #23 descibe each - e.g. serialized, off-heap, replicated