Apache spark - Spark's distributed programming model

Spark’s distributed programming model
Martin Zapletal Cake Solutions
Apache Spark

Apache Spark and Big Data
1) History and market overview
2) Installation
3) MLlib and machine learning on Spark
4) Porting R code to Scala and Spark
5) Concepts - Core, SQL, GraphX, Streaming
6) Spark’s distributed programming model

Table of Contents
● Distributed programming introduction
● Programming models
● Datafow systems and DAGs
● RDD
● Transformations, Actions, Persistence, Shared variables

Distributed programming
● reminder
○ unreliable network
○ ubiquitous failures
○ everything asynchronous
○ consistency, ordering and synchronisation expensive
○ local time
○ correctness properties safety and liveness
○ ...

Two armies (generals)
● two armies, A (Red) and B (Blue)
● separated parts A1 and A2 of A army must synchronize attack to win
● consensus with unreliable communication channel
● no node failures, no byzantine failures, …
● designated leader

Parallel programming models
● Parallel computing models
○ Different parallel computing problems
■ Easily parallelizable or communication needed
○ Shared memory
■ On one machine
● Multiple CPUs/GPUs share memory
■ On multiple machines
● Shared memory accessed via network
● Still much slower compared to memory
■ OpenMP, Global Arrays, …
○ Share nothing
■ Processes communicate by sending messages
■ Send(), Receive()
■ MPI
○ usually no fault tolerance

Dataflow system
● term used to describe general parallel programming approach
● in traditional von Neumann architecture instructions executed sequentially by a
worker (cpu) and data do not move
● in Dataflow workers have different tasks assigned to them and form an assembly
line
● program represented by connections and black box operations - directed graph
● data moves between tasks
● task executed by worker as soon as inputs available
● inherently parallel
● no shared state
● closer to functional programming
● not Spark specific (Stratosphere, MapReduce, Pregel, Giraph, Storm, ...)

MapReduce
● shows that Dataflow can be expressed in terms of map and reduce
operations
● simple to parallelize
● but each map-reduce is separate from the rest

Directed acyclic graph
● Spark is a Dataflow execution engine that supports cyclic data flows
● whole DAG is formed lazily
● allows global optimizations
● has expresiveness of MPI
● lineage tracking

Optimizations
● similar to optimizations of RDBMS (operation reordering, bushy
join-order enumeration, aggregation push-down)
● however DAGs less restrictive than database queries and it is
difficult to optimize UDFs (higher order functions used in Spark,
Flink)
● potentially major performance improvement
● partially support for incremental algorithm optimization (local
change) with sparse computational dependencies (GraphX)

Optimizations
sc
.parallelize(people)
.map(p => Person(p.age, p.height * 2.54))
.filter(_.age < 35)
sc
.filter(_.age < 35)
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))

Optimizations
sc
.filter(_.height < 170)
sc
.filter(_.height < 170)
case class Person(age: Int, height: Double)
val people = (0 to 100).map(x => Person(x, x))
???

Optimizations
1. logical rewriting applying rules to trees of operators (e.g. filter push down)
○ static code analysis (bytecode of each UDF) to check reordering rules
○ emits all valid reordered data flow alternatives
2. logical representation translated to physical representation
○ chooses physical execution strategies for each alternative (partitioning,
broadcasting, external sorts, merge and hash joins, …)
○ uses a cost based optimizer (I/O, disk I/O, CPU costs, UDF costs, network)

Stream optimizations
● similar, because in Spark streams are just mini batches
● a few extra window, state operations
pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1))
counts = ones.runningReduce((a, b) => a + b)

Performance
Hadoop Spark Spark
Data size 102.5 TB 100 TB 1000 TB
Time [min] 72 23 234
Nodes 2100 206 190
Cores 50400 6592 6080
Rate/node [GB/min] 0.67 20.7 22.5
Environment dedicated data center EC2 EC2
● fastest open source solution to sort 100TB data in Daytona Gray Sort Benchmark (http:
//sortbenchmark.org/)
● required some improvements in shuffle approach
● very optimized sorting algorithm (cache locality, unsafe off-heap memory structures, gc, …)
● Databricks blog + presentation

Spark programming model
● RDD
● parallelizing collections
● loading external datasets
● operations
○ transformations
○ actions
● persistence
● shared variables

RDD
● transformations
○ lazy, form the DAG
○ map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, sample, union,
intersection, distinct, groupByKey, reduceByKey, sortByKey, join, cogroup,
repatition, cartesian, glom, ...
● actions
○ execute DAG
○ retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● different categories of transformations with different complexity, performance and
sematics
● e.g. mapping, filtering, grouping, set operations, sorting, reducing, partitioning
● full list https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.
rdd.RDD

Transformations with narrow deps
● map
● union
● join with copartitioned inputs

Transformations with wide deps
● groupBy
● join without copartitioned inputs

Actions collect
● retrieves result to driver program
● no longer distributed

Actions reduction
● associative, commutative operation

Cache
● cache partitions to be reused in next actions on it or on datasets derived
from it
● snapshot used instead of lineage recomputation
● fault tolerant
● cache(), persist()
● levels
○ memory
○ disk
○ both
○ serialized
○ replicated
○ off-heap
● automatic cache after shuffle

Shared variables - broadcast
● usually all variables used in UDF are copies on each node
● shared r/w variables would be very inefficient
● broadcast
○ read only variables
○ efficient broadcast algorithm, can deliver data cheaply to all nodes
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value

Shared variables - accumulators
● accumulators
○ add only
○ use associative operation so efficient in parallel
○ only driver program can read the value
○ exactly once semantics only guaranteed for actions (in case of failure
and recalculation)
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
accum.value

Shared variables - accumulators
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}

Conclusion
● expressive and abstract programming model
● user defined functions
● based on research
● optimizations
● constraining in certain cases (spanning partition boundaries, functions of
multiple variables, ...)

Apache spark - Spark's distributed programming model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Apache spark - Spark's distributed programming model

Similar to Apache spark - Spark's distributed programming model (20)

More from Martin Zapletal

More from Martin Zapletal (6)

Recently uploaded

Recently uploaded (20)

Apache spark - Spark's distributed programming model

Editor's Notes