internals

Apache Spark
Internals
Sandeep Purohit
Software Consultant
Knoldus Software LLP

Agenda
● Architecture of spark cluster
● Tasks, Stages, Jobs
● DAG
● Execution Workflow
● Demo

Architecture of Spark Cluster
Driver Program
(SparkContext)
Cluster Manager
Executer
Executer
Executer
Executer
Worker node
Worker node
Master node Standalone
Yarn
Mesos
Executer

● Master Node: Master node is the node on which the driver
program will be running i.e. the main() method of the application
will be running.
● Worker Node: Worker node have an executor and cache which
is responsible for running task.
● Executer: Executers are one responsible for running the tasks
and also provide the memory to store RDD's.
● Driver Program: Driver program is responsible for 2
duties:creating tasks, scheduling tasks.
● Cluster manager: Cluster manager is responsible for monitoring
the cluster and providing the resources to the executors.

Tasks, Stages, Jobs
● Tasks: smallest individual unit of execution that
represents a partition in a dataset.
Partition 1
Partition 2
Partition 3
RDD Stage
Task 1
Task 2
Task 3

● Stages: stages is the collection of tasks,
whenever the shuffle will be happen the next
task will be in different stage.
Any transformation which create
shuffleRDD
Stage 1
Stage 2
Any transformation or
action

● Jobs: jobs are the action which is submitted to
the DAGScheduler by the spark driver to run
task using the RDD lineag graph.
RDD DAGScheduler executor

DAG
● Spark Schedular create the DAG of the stages
to send the DAG object to workers to evaluate
the final result.
map
filter
count
repartition

Execution workflow
RDD object
(Create DAG) DAG Schedular
(Split graph into
stages)
Cluster
Manager
worker
DAG Taskset Run
Task

internals

Recommended

Recommended

More Related Content

Similar to internals

Similar to internals (20)

internals