4. ● Master Node: Master node is the node on which the driver
program will be running i.e. the main() method of the application
will be running.
● Worker Node: Worker node have an executor and cache which
is responsible for running task.
● Executer: Executers are one responsible for running the tasks
and also provide the memory to store RDD's.
● Driver Program: Driver program is responsible for 2
duties:creating tasks, scheduling tasks.
● Cluster manager: Cluster manager is responsible for monitoring
the cluster and providing the resources to the executors.
5. Tasks, Stages, Jobs
● Tasks: smallest individual unit of execution that
represents a partition in a dataset.
Partition 1
Partition 2
Partition 3
RDD Stage
Task 1
Task 2
Task 3
6. ● Stages: stages is the collection of tasks,
whenever the shuffle will be happen the next
task will be in different stage.
Any transformation which create
shuffleRDD
Stage 1
Stage 2
Any transformation or
action
7. ● Jobs: jobs are the action which is submitted to
the DAGScheduler by the spark driver to run
task using the RDD lineag graph.
RDD DAGScheduler executor
8. DAG
● Spark Schedular create the DAG of the stages
to send the DAG object to workers to evaluate
the final result.
map
filter
count
repartition