Budapest Spark Meetup - Basics of Spark coding

CTO & Co-Founder
GULYÁS MÁTÉ
@gulyasm

Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Cluster Managers
UNIFIED STACK
Spark Core
Spark SQL
Spark Streaming
MLlib
GraphX
Cluster Managers

RDD API
Dataframe API
Dataset API
UNIFIED STACK
Spark Core
RDD API
Dataframe API
Dataset API

❏ Scala
❏ Java
❏ Python
❏ R
WHICH LANGUAGE TO SPARK ON?

DRIVER PROGRAM
Your main function. This is what you write.
Launches parallel operations on the cluster. The
driver access Spark through SparkContext.
You access the computing cluster via SparkContext
Via SparkContext you can create RDDs.

❏ INTERACTIVE
❏ STANDALONE
A “SPARK SOFTWARE”

Resilient Distributed
Dataset (RDD)
THE MAIN ATTRACTION

❏ TRANSFORMATION
❏ ACTION
OPERATIONS ON RDD

CREATES ANOTHER RDD
TRANSFORMATION

CALCULATE VALUE AND RETURN IT
TO THE DRIVER PROGRAM
ACTION

❏ The code: github.com/gulyasm/bigdata
❏ Databricks site: spark.apache.org
❏ User mailing list
❏ Spark books
MATERIALS

MATE GULYAS
gulyasm@enbrite.ly
@gulyasm
@enbritely
THANK YOU!

TRANSFORMATIONS
ACTIONS
LAZY EVALUATION

LIFECYCLE OF A SPARK PROGRAM
1. READ DATA FROM EXTERNAL SOURCE
2. CREATE LAZY EVALUATED
TRANSFORMATIONS
3. CACHE ANY INTERMEDIATE RDD TO REUSE
4. KICK IT OFF BY CALLING SOME ACTION

RDD INTERNALS
RDD INTERFACE
➔ set of PARTITIONS
➔ list of DEPENDENCIES on PARENT RDDs
➔ functions to COMPUTE a partition given parents
➔ preferred LOCATIONS (optional)
➔ PARTITIONER for K/V pairs (optional)

MULTIPLE RDDs
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
/** Implemented by subclasses to return the set of partitions in this RDD. */
protected def getPartitions: Array[Partition]
/** Implemented by subclasses to return how this RDD depends on parent RDDs.
*/
protected def getDependencies : Seq[Dependency[_]] = deps
/** Optionally overridden by subclasses to specify placement preferences. */
protected def getPreferredLocations (split: Partition): Seq[String] = Nil
/** Optionally overridden by subclasses to specify how they are partitioned.
*/
@transient val partitioner: Option[Partitioner] = None

THE IMPORTANT PART
❏ HOW EXECUTION WORKS
❏ TERMINOLOGY
❏ WHAT SHOULD WE CARE ABOUT

PIPELINING
❏ Parallel to CPU pipelining
❏ More steps at a time
❏ Recap: computation kicks of when an
action is called due to lazy evaluation

PIPELINING
text = sc.textFile("twit1.txt")
words = nonempty.flatMap(lambda x: x.split(" "))
fwords = words.filter(lambda x: len(x) > 0)
ones = fwords.map(lambda x: (x, 1))
result = ones.reduceByKey(lambda l,r: r+l)
result.collect()

PIPELINING
text = sc.textFile( )
words = nonempty.flatMap( )
fwords = words.filter( )
ones = fwords.map( )
result = ones.reduceByKey( )
result.collect()

PIPELINING
sc.textFile( )
.flatMap( )
.filter( )
.map( )
.reduceByKey( )

PIPELINING
sc.textFile().flatMap().filter().map().reduceByKey()

RDD RDD RDD RDD RDD
textFile(
) flatMap() filter() map() reduceByKey()
text resultwords fwords ones
PIPELINING

PIPELINING
def runJob[T, U](
rdd: RDD[T],
partitions: Seq[Int],
func: (Iterator[T]) => U)
) : Array[U]

RDD RDD RDD RDD RDD
textFile(
collect()
PIPELINING

JOB
❏ Basically an action
❏ An action creates a job
❏ A whole computation with all
dependencies

RDD RDD RDD RDD RDD
textFile(
collect()
Job

STAGE
❏ Unit of execution
❏ Named after the last transformation
(the one runJob was called on)
❏ Transformations pipelined together into
stages
❏ Stage boundary usually means shuffling

RDD RDD RDD RDD RDD
textFile(
collect()
Job
Stage 1 Stage 2

RDD RDD RDD RDD RDD
textFile(
collect()
Job
Stage 1 Stage 2
PT1
PT2
PT1
PT2
PT1
PT2
PT1
PT2
PT1
PT1
Shuffle

Repartitioning
text = sc.textFile("twit1.txt")
words = nonempty.flatMap(lambda x: x.split(" "))
fwords = words.filter(lambda x: len(x) > 1)
ones = fwords.map(lambda x: (x, 1))
rp = ones.repartition(6)
result = rp.reduceByKey(lambda l,r: r+l)
result.collect()

TaskSet
THE PROCESS
RDD Objects DAG Scheduler Task Scheduler Executor
RDD
RDD RDD
RDD
RDD
sc.textFile.
map()
.groupBy()
.filter()
Build DAG of
operators
T
T
T
T
T
T
T
T
T
S
S
S
S
- Split DAG into
stages of tasks
- Each stage when
ready = ALL
dependent task are
finished
DAG Task
Task
Scheduler
- Launches tasks
- Retry failed tasks
Executor
Block manager
Task threads
Task threads
Task threads
- Store and serve
blocks
- Executes tasks

Budapest Spark Meetup - Basics of Spark coding

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

More from Máté Gulyás

More from Máté Gulyás (7)

Recently uploaded

Recently uploaded (20)

Budapest Spark Meetup - Basics of Spark coding