Core services behind Spark Job
Execution
DAGScheduler and TaskScheduler
Agenda
● Spark Architecture
● RDD
● Job-Stage-Task
● Bird Eye View
● DAGScheduler
● TaskScheduler
Spark Architecture
Spark Architecture
Driver
Hosts SparkContext
Cockpit of Jobs and Task
Execution
Schedules Tasks to run
on executors
Contains DAGScheduler
and TaskScheduler
Executor
Static allocation vs
Dynamic allocation
Sends Heartbeat and Metrics
Provides In memory storage
for RDD
Communicates directly with
driver to execute task
RDD — Resilient Distributed Dataset
● RDD is the primary data abstraction in Spark and the core of Spark
● Motivation for RDD
● Features of RDD
RDD — Resilient Distributed Dataset
● RDD creation
● RDD Lineage
● Lazy Execution
● Partitions
RDD operations
● Transformations
● Actions
A sample Spark program
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://...") // This is an RDD
val errors = file.filter(_.contains("ERROR")) // This is an RDD
val errorCount = errors.count() // This is an “action”
Job-Stage-Task
What is a Job?
Top level work item
Computation job ==
Computation Partition of RDD
Target RDD Lineage
Job-Stage-Task
Job divided into stages
Logical Plan → Physical plan (execution unit)
Set of Parallel task
Stage Boundary (Shuffle)
Computation of stage triggers parents stage
execution
Types of Stages
ShuffleMapStage:
Intermediate stage in execution DAG
Saves map output → fetched later
Pipelined operations before shuffle
Can be Shared across jobs
ResultStage:
Final stage executing action
Works on one or many partitions
Job-Stage-Task
Smallest unit of execution
Comprises of function and placement
preference
Task operate on a single partition
Launched on executor and ran there
Types of Task
ShuffleMapTask:
Intermediate stage
Returns MapStatus
ResultStageTask:
Last Stage
Returns output to driver
DAGScheduler
● Initialization
● Stage-oriented scheduling (Logical - Physical)
● DAGSchedulerEvent (Job or Stage)
● Stage Submissions
DAGScheduler Responsibilities
● Computes an execution DAG, submits to stages
to TaskScheduler
DAGScheduler Responsibilities
● Computes an execution DAG
● Determines the preferred locations to run each
task on, keeps track of cached RDD
DAGScheduler Responsibilities
● Computes an execution DAG
● Determines the preferred locations to run each
task on
● Handles failures due to shuffle output files being
lost (FetchFailed, ExecutorLost)
DAGScheduler Responsibilities
● Computes an execution DAG
● Determines the preferred locations to run each
task on
● Handles failures due to shuffle output files being
lost (FetchFailed, ExecutorLost)
● Stage retry
TaskScheduler
● LifeCycle
TaskSet and TaskSetManager
● What is a TaskSet?
○ Fully independent sequence of task
● Why TaskSetManager
● Responsibilities of TaskSetManager
○ Scheduling tasks in a TaskSet
○ Completion notification
○ Retry and Abort
○ Locality preference
TaskScheduler’s Responsibilities 
● responsible for submitting tasks for execution for every stage
● works closely with DAGScheduler for resubmission of stage
● tracks the executors in a Spark application (executorHeartBeat and executorLost)
Thank You

Core Services behind Spark Job Execution