SlideShare a Scribd company logo
1 of 13
Apache Spark
Internals
Sandeep Purohit
Software Consultant
Knoldus Software LLP
Agenda
● Architecture of spark cluster
● Tasks, Stages, Jobs
● DAG
● Execution Workflow
● Demo
Architecture of Spark Cluster
Driver Program
(SparkContext)
Cluster Manager
Executer
Executer
Executer
Executer
Worker node
Worker node
Master node Standalone
Yarn
Mesos
Executer
● Master Node: Master node is the node on which the driver
program will be running i.e. the main() method of the application
will be running.
● Worker Node: Worker node have an executor and cache which
is responsible for running task.
● Executer: Executers are one responsible for running the tasks
and also provide the memory to store RDD's.
● Driver Program: Driver program is responsible for 2
duties:creating tasks, scheduling tasks.
● Cluster manager: Cluster manager is responsible for monitoring
the cluster and providing the resources to the executors.
Tasks, Stages, Jobs
● Tasks: smallest individual unit of execution that
represents a partition in a dataset.
Partition 1
Partition 2
Partition 3
RDD Stage
Task 1
Task 2
Task 3
● Stages: stages is the collection of tasks,
whenever the shuffle will be happen the next
task will be in different stage.
Any transformation which create
shuffleRDD
Stage 1
Stage 2
Any transformation or
action
● Jobs: jobs are the action which is submitted to
the DAGScheduler by the spark driver to run
task using the RDD lineag graph.
RDD DAGScheduler executor
DAG
● Spark Schedular create the DAG of the stages
to send the DAG object to workers to evaluate
the final result.
map
filter
count
repartition
Execution workflow
RDD object
(Create DAG) DAG Schedular
(Split graph into
stages)
Cluster
Manager
worker
DAG Taskset Run
Task
Execution Workflow
Demo
Q&A
Thanks

More Related Content

Similar to internals

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Advanced task management with Celery
Advanced task management with CeleryAdvanced task management with Celery
Advanced task management with CeleryMahendra M
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
Apache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basicsApache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basicsGladson Manuel
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Jab12 - Joomla! architecture revealed
Jab12 - Joomla! architecture revealedJab12 - Joomla! architecture revealed
Jab12 - Joomla! architecture revealedOfer Cohen
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Summer Internship Project - Remote Render
Summer Internship Project - Remote RenderSummer Internship Project - Remote Render
Summer Internship Project - Remote RenderYen-Kuan Wu
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS GlueLaercio Serra
 
Apache spark? if only it worked
Apache spark? if only it workedApache spark? if only it worked
Apache spark? if only it workedMarcin Szymaniuk
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 

Similar to internals (20)

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Advanced task management with Celery
Advanced task management with CeleryAdvanced task management with Celery
Advanced task management with Celery
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basicsApache Mesos: a simple explanation of basics
Apache Mesos: a simple explanation of basics
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Jab12 - Joomla! architecture revealed
Jab12 - Joomla! architecture revealedJab12 - Joomla! architecture revealed
Jab12 - Joomla! architecture revealed
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Summer Internship Project - Remote Render
Summer Internship Project - Remote RenderSummer Internship Project - Remote Render
Summer Internship Project - Remote Render
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Apache Spark e AWS Glue
Apache Spark e AWS GlueApache Spark e AWS Glue
Apache Spark e AWS Glue
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
SWT Tech Sharing: Node.js + Redis
SWT Tech Sharing: Node.js + RedisSWT Tech Sharing: Node.js + Redis
SWT Tech Sharing: Node.js + Redis
 
Apache spark? if only it worked
Apache spark? if only it workedApache spark? if only it worked
Apache spark? if only it worked
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 

internals

  • 1. Apache Spark Internals Sandeep Purohit Software Consultant Knoldus Software LLP
  • 2. Agenda ● Architecture of spark cluster ● Tasks, Stages, Jobs ● DAG ● Execution Workflow ● Demo
  • 3. Architecture of Spark Cluster Driver Program (SparkContext) Cluster Manager Executer Executer Executer Executer Worker node Worker node Master node Standalone Yarn Mesos Executer
  • 4. ● Master Node: Master node is the node on which the driver program will be running i.e. the main() method of the application will be running. ● Worker Node: Worker node have an executor and cache which is responsible for running task. ● Executer: Executers are one responsible for running the tasks and also provide the memory to store RDD's. ● Driver Program: Driver program is responsible for 2 duties:creating tasks, scheduling tasks. ● Cluster manager: Cluster manager is responsible for monitoring the cluster and providing the resources to the executors.
  • 5. Tasks, Stages, Jobs ● Tasks: smallest individual unit of execution that represents a partition in a dataset. Partition 1 Partition 2 Partition 3 RDD Stage Task 1 Task 2 Task 3
  • 6. ● Stages: stages is the collection of tasks, whenever the shuffle will be happen the next task will be in different stage. Any transformation which create shuffleRDD Stage 1 Stage 2 Any transformation or action
  • 7. ● Jobs: jobs are the action which is submitted to the DAGScheduler by the spark driver to run task using the RDD lineag graph. RDD DAGScheduler executor
  • 8. DAG ● Spark Schedular create the DAG of the stages to send the DAG object to workers to evaluate the final result. map filter count repartition
  • 9. Execution workflow RDD object (Create DAG) DAG Schedular (Split graph into stages) Cluster Manager worker DAG Taskset Run Task
  • 11. Demo
  • 12. Q&A