Apache Spark
Internals
Sandeep Purohit
Software Consultant
Knoldus Software LLP
Agenda
● Architecture of spark cluster
● Tasks, Stages, Jobs
● DAG
● Execution Workflow
● Demo
Architecture of Spark Cluster
Driver Program
(SparkContext)
Cluster Manager
Executer
Executer
Executer
Executer
Worker node
Worker node
Master node Standalone
Yarn
Mesos
Executer
● Master Node: Master node is the node on which the driver
program will be running i.e. the main() method of the application
will be running.
● Worker Node: Worker node have an executor and cache which
is responsible for running task.
● Executer: Executers are one responsible for running the tasks
and also provide the memory to store RDD's.
● Driver Program: Driver program is responsible for 2
duties:creating tasks, scheduling tasks.
● Cluster manager: Cluster manager is responsible for monitoring
the cluster and providing the resources to the executors.
Tasks, Stages, Jobs
● Tasks: smallest individual unit of execution that
represents a partition in a dataset.
Partition 1
Partition 2
Partition 3
RDD Stage
Task 1
Task 2
Task 3
● Stages: stages is the collection of tasks,
whenever the shuffle will be happen the next
task will be in different stage.
Any transformation which create
shuffleRDD
Stage 1
Stage 2
Any transformation or
action
● Jobs: jobs are the action which is submitted to
the DAGScheduler by the spark driver to run
task using the RDD lineag graph.
RDD DAGScheduler executor
DAG
● Spark Schedular create the DAG of the stages
to send the DAG object to workers to evaluate
the final result.
map
filter
count
repartition
Execution workflow
RDD object
(Create DAG) DAG Schedular
(Split graph into
stages)
Cluster
Manager
worker
DAG Taskset Run
Task
Execution Workflow
Demo
Q&A
Thanks

Internals

  • 1.
    Apache Spark Internals Sandeep Purohit SoftwareConsultant Knoldus Software LLP
  • 2.
    Agenda ● Architecture ofspark cluster ● Tasks, Stages, Jobs ● DAG ● Execution Workflow ● Demo
  • 3.
    Architecture of SparkCluster Driver Program (SparkContext) Cluster Manager Executer Executer Executer Executer Worker node Worker node Master node Standalone Yarn Mesos Executer
  • 4.
    ● Master Node:Master node is the node on which the driver program will be running i.e. the main() method of the application will be running. ● Worker Node: Worker node have an executor and cache which is responsible for running task. ● Executer: Executers are one responsible for running the tasks and also provide the memory to store RDD's. ● Driver Program: Driver program is responsible for 2 duties:creating tasks, scheduling tasks. ● Cluster manager: Cluster manager is responsible for monitoring the cluster and providing the resources to the executors.
  • 5.
    Tasks, Stages, Jobs ●Tasks: smallest individual unit of execution that represents a partition in a dataset. Partition 1 Partition 2 Partition 3 RDD Stage Task 1 Task 2 Task 3
  • 6.
    ● Stages: stagesis the collection of tasks, whenever the shuffle will be happen the next task will be in different stage. Any transformation which create shuffleRDD Stage 1 Stage 2 Any transformation or action
  • 7.
    ● Jobs: jobsare the action which is submitted to the DAGScheduler by the spark driver to run task using the RDD lineag graph. RDD DAGScheduler executor
  • 8.
    DAG ● Spark Schedularcreate the DAG of the stages to send the DAG object to workers to evaluate the final result. map filter count repartition
  • 9.
    Execution workflow RDD object (CreateDAG) DAG Schedular (Split graph into stages) Cluster Manager worker DAG Taskset Run Task
  • 10.
  • 11.
  • 12.
  • 13.