Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Core Services behind Spark Job Execution

485 views

Published on

DAGScheduler and TaskScheduler

Published in: Data & Analytics
  • Login to see the comments

Core Services behind Spark Job Execution

  1. 1. Core services behind Spark Job Execution DAGScheduler and TaskScheduler
  2. 2. Agenda ● Spark Architecture ● RDD ● Job-Stage-Task ● Bird Eye View ● DAGScheduler ● TaskScheduler
  3. 3. Spark Architecture
  4. 4. Spark Architecture Driver Hosts SparkContext Cockpit of Jobs and Task Execution Schedules Tasks to run on executors Contains DAGScheduler and TaskScheduler Executor Static allocation vs Dynamic allocation Sends Heartbeat and Metrics Provides In memory storage for RDD Communicates directly with driver to execute task
  5. 5. RDD — Resilient Distributed Dataset ● RDD is the primary data abstraction in Spark and the core of Spark ● Motivation for RDD ● Features of RDD
  6. 6. RDD — Resilient Distributed Dataset ● RDD creation ● RDD Lineage ● Lazy Execution ● Partitions
  7. 7. RDD operations ● Transformations ● Actions
  8. 8. A sample Spark program val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) val file = sc.textFile("hdfs://...") // This is an RDD val errors = file.filter(_.contains("ERROR")) // This is an RDD val errorCount = errors.count() // This is an “action”
  9. 9. Job-Stage-Task What is a Job? Top level work item Computation job == Computation Partition of RDD Target RDD Lineage
  10. 10. Job-Stage-Task Job divided into stages Logical Plan → Physical plan (execution unit) Set of Parallel task Stage Boundary (Shuffle) Computation of stage triggers parents stage execution
  11. 11. Types of Stages ShuffleMapStage: Intermediate stage in execution DAG Saves map output → fetched later Pipelined operations before shuffle Can be Shared across jobs ResultStage: Final stage executing action Works on one or many partitions
  12. 12. Job-Stage-Task Smallest unit of execution Comprises of function and placement preference Task operate on a single partition Launched on executor and ran there
  13. 13. Types of Task ShuffleMapTask: Intermediate stage Returns MapStatus ResultStageTask: Last Stage Returns output to driver
  14. 14. DAGScheduler ● Initialization ● Stage-oriented scheduling (Logical - Physical) ● DAGSchedulerEvent (Job or Stage) ● Stage Submissions
  15. 15. DAGScheduler Responsibilities ● Computes an execution DAG, submits to stages to TaskScheduler
  16. 16. DAGScheduler Responsibilities ● Computes an execution DAG ● Determines the preferred locations to run each task on, keeps track of cached RDD
  17. 17. DAGScheduler Responsibilities ● Computes an execution DAG ● Determines the preferred locations to run each task on ● Handles failures due to shuffle output files being lost (FetchFailed, ExecutorLost)
  18. 18. DAGScheduler Responsibilities ● Computes an execution DAG ● Determines the preferred locations to run each task on ● Handles failures due to shuffle output files being lost (FetchFailed, ExecutorLost) ● Stage retry
  19. 19. TaskScheduler ● LifeCycle
  20. 20. TaskSet and TaskSetManager ● What is a TaskSet? ○ Fully independent sequence of task ● Why TaskSetManager ● Responsibilities of TaskSetManager ○ Scheduling tasks in a TaskSet ○ Completion notification ○ Retry and Abort ○ Locality preference
  21. 21. TaskScheduler’s Responsibilities  ● responsible for submitting tasks for execution for every stage ● works closely with DAGScheduler for resubmission of stage ● tracks the executors in a Spark application (executorHeartBeat and executorLost)
  22. 22. Thank You

×