Core Services behind Spark Job Execution

Core services behind Spark Job
Execution
DAGScheduler and TaskScheduler

Agenda
● Spark Architecture
● RDD
● Job-Stage-Task
● Bird Eye View
● DAGScheduler
● TaskScheduler

Spark Architecture
Driver
Hosts SparkContext
Cockpit of Jobs and Task
Execution
Schedules Tasks to run
on executors
Contains DAGScheduler
and TaskScheduler
Executor
Static allocation vs
Dynamic allocation
Sends Heartbeat and Metrics
Provides In memory storage
for RDD
Communicates directly with
driver to execute task

RDD — Resilient Distributed Dataset
● RDD is the primary data abstraction in Spark and the core of Spark
● Motivation for RDD
● Features of RDD

RDD — Resilient Distributed Dataset
● RDD creation
● RDD Lineage
● Lazy Execution
● Partitions

RDD operations
● Transformations
● Actions

A sample Spark program
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://...") // This is an RDD
val errors = file.filter(_.contains("ERROR")) // This is an RDD
val errorCount = errors.count() // This is an “action”

Job-Stage-Task
What is a Job?
Top level work item
Computation job ==
Computation Partition of RDD
Target RDD Lineage

Job-Stage-Task
Job divided into stages
Logical Plan → Physical plan (execution unit)
Set of Parallel task
Stage Boundary (Shuffle)
Computation of stage triggers parents stage
execution

Types of Stages
ShuffleMapStage:
Intermediate stage in execution DAG
Saves map output → fetched later
Pipelined operations before shuffle
Can be Shared across jobs
ResultStage:
Final stage executing action
Works on one or many partitions

Job-Stage-Task
Smallest unit of execution
Comprises of function and placement
preference
Task operate on a single partition
Launched on executor and ran there

Types of Task
ShuffleMapTask:
Intermediate stage
Returns MapStatus
ResultStageTask:
Last Stage
Returns output to driver

DAGScheduler
● Initialization
● Stage-oriented scheduling (Logical - Physical)
● DAGSchedulerEvent (Job or Stage)
● Stage Submissions

DAGScheduler Responsibilities
● Computes an execution DAG, submits to stages
to TaskScheduler

● Computes an execution DAG
● Determines the preferred locations to run each
task on, keeps track of cached RDD

task on
● Handles failures due to shuffle output files being
lost (FetchFailed, ExecutorLost)

task on
● Handles failures due to shuffle output files being
lost (FetchFailed, ExecutorLost)
● Stage retry

TaskSet and TaskSetManager
● What is a TaskSet?
○ Fully independent sequence of task
● Why TaskSetManager
● Responsibilities of TaskSetManager
○ Scheduling tasks in a TaskSet
○ Completion notification
○ Retry and Abort
○ Locality preference

TaskScheduler’s Responsibilities
● responsible for submitting tasks for execution for every stage
● works closely with DAGScheduler for resubmission of stage
● tracks the executors in a Spark application (executorHeartBeat and executorLost)

Core Services behind Spark Job Execution

More Related Content

What's hot

Similar to Core Services behind Spark Job Execution

More from datamantra

Recently uploaded

Core Services behind Spark Job Execution